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Abstract. Recent work has demonstrated how to calculate conditional 
mediated effects for mediation models with zero-inflated count outcomes 
in a non-causal framework (O'Rourke & Vazquez, 2019); however, those 
formulas do not distinguish between logistic and count portions of the 
data distribution when calculating mediated effects separately for zeroes 
and counts. When calculating conditional mediated effects for the counts 
in a zero-inflated count outcome Y, the b path should use the partial 
derivative of the log-linear regression equation for X and M predicting 
Y. When calculating conditional mediated effects for the zeroes, the b 
path should use the partial derivative of the logistic regression equation 
for X and M predicting Y instead of the log-linear equation. This paper 
presents adjustments to the analytical formulas of conditional mediated 
effects for mediation with zero-inflated count outcomes when zeroes and 
counts are differentially predicted. Using a Monte Carlo simulation, we 
also empirically show that these adjustments produce different results 
than when the distributional form of zeroes is ignored. 


Keywords: Mediation analysis - Count outcomes - Zero-inflation - ZIP - 
ZINB - Hurdle models 


1 Introduction 


Many theories in the social and behavioral sciences specify indirect mechanisms 
by which predictors influence outcomes. These mechanisms, also known as medi- 
ators, are incorporated into such theories through the use of mediation models. 
Mediation models are widely applied to theories of human behavior and test the 
indirect influence of a predictor variable (X) on an outcome (Y) via a mediator 
(M). Much methodological research on the mediation model has focused on mod- 
els where the endogenous variables M and Y are continuously distributed and 
assume linear associations, and several extensions have been proposed as well for 
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models where M and Y are categorical (i.e., binary or count) variables that are 
modeled with logistic or other exponential family distributions (Coxe & MacK- 
innon, 2010; Gilula, 2012; Iacobucci, 2012; Imai, Keele, & Tingley, 2010; Mack- 
innon, 2008; MacKinnon & Cox, 2012; Mackinnon & Dwyer, 1993; Preacher, 
2015; Valeri & VanderWeele, 2013; VanderWeele, Zhang, & Lim, 2016). How- 
ever, these methods are not appropriate for use where categorical endogenous 
variables contain zero-inflation. 


Zero-inflation occurs when the proportion of observations with a value of 
zero on a particular variable is larger than what is expected from the variable’s 
typical zero-uninflated distribution (for example, Poisson or negative binomial if 
a variable is a measure of counts). Zero-inflated (ZI) count variables are common 
in the social sciences. For example, consider a study of externalizing behaviors in 
middle school; for a given count variable measuring bullying as “number of times 
child was a bully in the past month”, many students would have a score of zero 
because most children do not engage in bullying behaviors. Another example 
from health intervention research would be measuring drinking outcomes in a 
study designed to help adults with alcohol use disorder quit drinking. For a 
drinking count variable measured as “number of drinks consumed in the past 
week”, many participants would have a score of zero because they are actively 
trying to refrain from drinking. 


The traditional methods cited above for categorical mediation analysis are 
not equipped to handle excess zeroes in the outcome, and using these models to 
fit data with zero-inflated distributions may result in biased estimates. A techni- 
cal body of literature does exist for causal inference methods to assess mediation 
with categorical variables that contain zero-inflation (Cheng et al., 2018; Wang 
& Albert, 2012) but this literature is not as accessible to applied researchers due 
to the complexity of its application. The causal literature differs from the gen- 
eral linear model (GLM)-based mediation literature in that it requires a working 
knowledge of causal inference frameworks that involve formulas for probability, 
and causal methods often require additional sensitivity analyses for a formal test 
of mediation. Furthermore, the effects involved in mediation from the causal in- 
ference literature are defined differently from GLM mediation effects, requiring 
the calculation and interpretation of multiple effects to determine mediation even 
in simple cases. 


Recent work on mediation for ZI counts has applied Geldhof, Anthony, Selig, 
and Mendez-Luck (2018)’s method of calculating mediation effects for count data 
that are conditional upon values of X to mediation models with zero-inflated 
count outcomes using a modeling framework that does not come from causal 
inference (O’Rourke & Vazquez, 2019). In this method, mediation effects are 
calculated separately for zeroes and counts when Y is zero-inflated. However, 
that method does not account for the unique distributional nature of the ze- 
roes, as zeroes are predicted using the binomial logistic model while counts are 
fitted using the log-linear model. This article illustrates a revised formula for 
calculating the mediation effect for the zeroes when zeroes and counts are dif- 
ferentially predicted. We also demonstrate via Monte Carlo simulation that the 
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revised formula produces different results than the original formula that does 
not distinguish between zeroes and counts in a zero-inflated outcome. 


1.1 Mediation 


The simplest single-mediator model is described by two OLS regression equations 
using notation from Mackinnon (2008). 


Y =i +b0M +EP X +e (1) 
M =i +aX +e (2) 


In these equations, X is the predictor, M is the mediator, and Y is the outcome. 
From Equation 1, the influence of M on Y is known as the b parameter, and the 
influence of X on Y controlling for M is known as the c’ parameter (also known 
as the “direct effect”). From Equation 2, the influence of X on M is known as 
the a parameter. The parameters 7; and i are model intercepts and e, and eg 
are model errors. The mediated effect that is the focus of this article is speci- 
fied as the product of the a and b parameters (ab), commonly referred to in the 
mediation literature (and hereafter referred to) as the “mediated effect”. Other 
specifications of the mediated effect and their equalities with respect to count 
outcomes are described elsewhere (Coxe & MacKinnon, 2010; MacKinnon, Lock- 
wood, Brown, Wang, & Hoffman, 2007; Mackinnon, 2008; O’Rourke & Vazquez, 
2019). 

Two common approaches to significance testing in mediation are the causal 
steps (Baron & Kenny, 1986; Judd & Kenny, 1981; MacKinnon, Lockwood, Hoff- 
man, West, & Sheets, 2002) and product of coefficients (Sobel, 1982) approaches. 
The recommended test from the causal steps approach is the Joint Significance 
test, which has the best balance of power and Type I error (MacKinnon et al., 
2002). The Joint Significance test uses individual z— or t—tests of estimates of 
the respective a and b parameters to assess significance: if both tests are sta- 
tistically significant, we can conclude that mediation is present. The product of 
coefficients approach was developed using a derived asymptotic standard error 
(Sobel, 1982) to compute a z—test for the mediated effect ab. More recently, 
it has become common to assess significance of the mediated effect by using 
bootstrapping to create asymmetric confidence intervals for ab (MacKinnon, 
Lockwood, & Williams, 2004). Bootstrapping is used because ab is a product of 
two variables and so it is not normally distributed (Aroian, 1947; Craig, 1936), 
meaning traditional formulas that assume a normal distribution of z produce 
biased confidence intervals for ab. 

Mediation analysis is conducted with count outcomes in a similar manner 
to the approach described above. The difference is that instead of a normal 
distribution of continuous Y (the assumption under linear regression), the count 
outcome Y is assumed to have a Poisson distribution, negative binomial (NB) 
distribution if overdispersed, or beta-binomial distribution if overdispersed with 
a restricted upper bound. If Y is a count outcome, Poisson, NB, or beta-binomial 
regression can be used to fit the model specified in Equation 1, and (assuming 
continuous M) Equation 2 can be assessed as usual with linear regression. 
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1.2 Zero-Inflated Counts 


Each of the models described for count outcomes has a ZI counterpart: the ZI 
Poisson (ZIP), ZI negative binominal (ZINB), or ZI beta-binomial (ZIBB) mod- 
els. These are known as zero-inflated generalized linear models (ZI-GZLMs). 
Hurdle models, similar to ZI-GZLMs, can also be used to model ZI count out- 
comes. However, zeroes are treated differently in hurdle models compared with 
ZI models. ZI-GZLMs assume that there are two kinds of zeroes: “structural” 
(i.e., excess) zeroes that will never take on another value, and “sampling” zeroes 
that have some potential to be non-zero. (For the remainder of this paper, when 
we refer to the zeroes in Y, we are referring to the structural zeroes that are 
modeled separately from the counts and sampling zeroes.) Hurdle models do not 
make the structural vs. sampling zero distinction, but instead assume that all 
zeroes are generated from the same process. In other words, hurdle models treat 
all zeroes as structural excess zeroes that are the only source of overdispersion 
in the data, and these models do not include an additional probability mass 
which distinguishes structural zeroes from counts and sampling zeroes in other 
ZI-GZLMs. 
In ZI-GZLMs, the probability of an occurrence of an excess zero is modeled 
as follows. 
z ~ Bernoulli(z) (3) 


Where m is the probability of observing excess zeroes. Assuming a ZI Poisson 
distribution (the simplest in terms of parameterization because mean and vari- 
ance are assumed to be equal), and where A is the mean count from the Poisson 
distribution, the probability mass function of a ZIP model is as follows. 


P(Y =0) =7 + (1 — r)e™^ (4) 
AY 
P(Y #0) =i (5) 


1.3 Mediation for Zero-Inflated Counts 


One recent non-causal method of assessing mediation for count outcomes sug- 
gested computing multiple conditional mediated effects for chosen values of the 
predictor X (Geldhof et al., 2018), representing the nonlinear relation between 
X and Y as several conditionally linear relations that differ across values of 
X (Stolzenberg, 1980). This method was extended to mediation models for ZI 
count outcomes (O’Rourke & Vazquez, 2019) by calculating two separate sets 
of conditional mediated effects: one for the zeroes and one for the counts. For 
a model with any measurement level of X, continuous M, and ZI count Y, b 
paths were calculated separately for structural zeroes and counts using the first 
partial derivative with respect to M of the loglinear mediation regression equa- 
tion shown in Equation 1. This loglinear mediation regression equation is given 


below. 
Y Z e1 +bM+c' X (6) 
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The first partial derivative of Equation 6 being the following. 


oY , 
bL = T = bier 2) (7) 


The formula in Equation 7 for bzz (b path from the loglinear equation) was 
used in conjunction with the a path from Equation 2 to calculate conditional 
mediated effects as follows. 

a x bry (8) 


Two sets of k conditional mediated effects were calculated separately for zeroes 
and counts using the formula in Equation 8, with k being equal to the number 
of chosen values of X (this number would typically be k = 2 for binary X, k = 3 
for continuous X at low, medium. and high values) and M fixed at its mean. 
Equation 8 was used to calculate sets of conditional mediated effects for both 
ZIP and ZINB models, as both Poisson and negative binomial models utilize log 
link functions. 


1.4 Distributional Form for Zeroes 


The method described above produced the desired sets of conditional mediated 
effects, however, using the partial derivative formula brz for both the structural 
zeroes and the counts disregarded the form of the assumed distribution for the 
structural zeroes. Specifically, the structural zeroes are modeled with a logistic 
distribution. Therefore, the logistic regression for predicting structural zeroes 


has a logit link function of 
T 


In( 


) (9) 


The mean function corresponding to this logit link function is as follows. 


l-r 


n ei1+tbM+c' X 
Y= eu tbM+eX 4 1 (10) 
Taking the first partial derivative of Equation 10 with respect to M gives us bre 


(the b path from the logistic mediation regression equation). 


oY i +bM +c! X 
bre = — = b aa (11) 
OM (eiL t6M +e xX + 1)2 
This would result in a conditional mediated effect for the zeroes of 
a * bia (12) 


Both of the first partial derivative formulas for b presented here are known 
quantities for estimating a mediation path with either a count or binary non-ZI 
endogenous variable (Geldhof et al., 2018; Li, Schneider, & Bennett, 2007), but 
they have not been used in tandem to handle two-part mediation models for ZI 
endogenous variables. 
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The current paper aims to utilize both of these formulas for the b path, and 
to demonstrate that using the b path from the logit link function bza when 
calculating mediated effects for the zeroes results in different estimates of the 
conditional mediated effects than using the b path from the log link function 
bLL for both zeroes and counts. In the next section, we describe a Monte Carlo 
simulation study that demonstrates that the formulas produce different esti- 
mates of the conditional mediated effects for the zeroes (hereafter referred to as 
“conditional mediated effects” ). 


2 Simulation Study 


2.1 Simulation Conditions 


We conducted a Monte Carlo simulation in R 4.2.3 in conjunction with Mplus 
version 8.10 (Muthén & Muthén, 2017). In this simulation study, we manipu- 
lated two factors: Sample size (N = 100, 250, 500, 750, and 1500) and population 
distribution of the counts in the outcome (Poisson vs. negative binomial). Simu- 
lation manipulations resulted in 2 x 5 = 10 conditions. Sample sizes were chosen 
to represent a range from small to large samples based on sample sizes commonly 
observed in the behavioral sciences. Manipulation of sample size allowed for us 
to examine possible effects of sample size on results by examining whether the 
difference in estimates of the conditional mediated effect grew smaller as sample 
size increased. The Poisson and negative binomial distributions for counts were 
chosen as the two distributions that are most commonly observed and practically 
applied with ZI-GzLMs. Differences in estimates of the conditional mediated ef- 
fects were expected to be stable across the two levels of distribution of the count 
outcomes. 

Population parameter values were not varied over conditions, and the param- 
eter values used in each simulation model are given in Table 1. 


Table 1. Simulation Study Parameter Values 


Parameter Population Value 
a 0.59 
b (Zeroes) -0.14 
b (Counts) 0.14 
c (Zeroes) -0.01 
cœ (Counts) 0.01 
UM 0 
OM 1 
uy (Zeroes) 0 
uy (Counts) 3 
o* 1 


*for ZINB models only 
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Parameter value magnitudes were chosen to reflect large (0.59) effect size for 
a and small (0.14) effect sizes for b as established in prior simulation research 
on single mediator models (Fritz & MacKinnon, 2007; MacKinnon et al., 2002; 
O’Rourke & MacKinnon, 2015), and parameter value signs were chosen such 
that conditional mediated effects would differ in sign for the zeroes and counts, 
as discussed below. The c’ path was assigned a very small effect size in accor- 
dance with a model that would approach full mediation, a condition where c’ = 
0 (Mackinnon, 2008), but still would factor into calculations of conditional me- 
diated effects. Table 2 shows population calculations of the conditional mediated 
effects across values of X based on the parameter values given in Table 1, using 
both the log link and logit link b paths. 


Table 2. Calculation of Population Conditional Mediated Effects for Log Link and 
Logit Link Formulas 


Log Link Function b path 


X=0 X=1 


General Formula a x b(eit tM +e"X) a x b(e't tM +e X) 
ei tbM+e!X e0+(—0.-14)(0)-+(—0.01)(0) _ 4 g0+(—90.14)(0)+(—0.01)(1) _ 1.01 
b(et1teM+c' X) —0.14 x 1 = —0.14 —0.14 * 1.01 = —0.141 
ae belt tote) 0.59 x —0.14 = —0.0826 0.59 * —0.141 = —0.0834 
Conditional Mediated Effect -0.0826 -0.0834 
Logit Link Function b path 
X=0 X=1 

ei FOMI X eil toM+c X 
General Formula a * b TEMIA yaya a* b TMII ae 
ei +bM+c'X et (—0.14)(0)+(—0.01)(0) — 4 e0+(=0.14)(0)+(—0.0)0) — 1.01 

eit +oM+c' X 1 = 1.01 = 
T FMFI X ary = 0.25 joins T 0-2499 
o —0.14 x 0.25 = —0.035 —0.14 * 0.2499 = —0.0349 


(et1 tbM+c' X }1)2 
ett toM+c' X 


oS 0.59 x —0.035 = —0.0207 0.59» —0.0349 = —0.0206 
(et FOM+e7X 442 
Conditional Mediated Effect -0.0207 -0.0206 


axb 


2.2 Data Generation and Data Analysis 


The R MplusAutomation package (Hallquist & Wiley, 2018) was used to simu- 
late data. For each of the conditions, 500 replications with complete data were 
simulated. The paths related to mediation (b and c’) were specified to be equal 
in magnitude but opposite in sign for the zeroes and counts in Y in accordance 
with commonly observed patterns of results in applied ZI-GZLMs. Binary X 
was simulated with a Bernoulli distribution with X € 0,1, and M was simulated 
with a continuous Gaussian distribution M ~ N(0,1). The counts in Y were 
simulated to have a mean of 3 and the zeroes in Y, a mean of 0. Replications for 
Y with a ZIP distribution did not include a dispersion parameter, and when Y 
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was simulated to have a ZINB distribution, the dispersion parameter was ¢ = 1 
as specified by the variance of Y in the Mplus MODEL command. 

After all datasets were generated, the R MplusAutomation package was then 
used to create and run Mplus scripts analyzing all replications within each con- 
dition and then import results into R. This process was repeated for each of 
the 10 conditions. For each replication, a ZI-GZLM was fitted to the data using 
Maximum Likelihood estimation. Conditional mediated effects were calculated 
at X = 0 and X = 1 using the Mplus “Model Constraint” command. For the 
zeroes in Y, sets of conditional mediated effects were calculated using both the 
original method with bzz for the b path and the revised method with bzg for 
the b path. This resulted in four conditional mediated effects for comparison in 
further analyses. Bootstrapped confidence intervals were also generated to assess 
significance of each conditional mediated effect. Sample Mplus and R scripts can 
be found on the GitHub project at https: //github.com/horourke/MZI12. 


2.3 Simulation Study Outcomes and Outcome Analyses 


We assessed differences in results for the conditional mediated effect estimates by 
examining relative parameter difference (i.e., relative bias for the abzg estimate). 
The relative difference was calculated as the difference between the population 
value of abra and the respective estimates aby, and abza, over the population 
value of abra. pa 

abra — abr, (13) 
abia 


abra = abie (14) 
ab LG 

Efficiency was calculated using the standard deviations of the raw estimates 
of the conditional mediated effects averaged across each condition. We also ex- 
amined statistical power for each condition, calculated as the proportion of repli- 
cations for which the p value associated with each conditional mediated effect 
was less than .05 and the bootstrapped confidence intervals of each conditional 
mediated effect did not include zero. 

In preparation for analysis of the relative difference outcome, data were re- 
structured to long format such that use of the b formula (bp, vs. bic) could 
be coded as an additional binary predictor of a given outcome. Analyses con- 
ducted in R examined the impact of the condition on the dependent variable 
of interest at the replication level, with one replication considered as one obser- 
vation. Analysis of variance (ANOVA) was used to investigate the differences 
in study conditions for relative parameter differences. Analyses were conducted 
separately for each outcome at X = 0 and X = 1. Factors representing study con- 
ditions in each ANOVA were sample size, population distribution of the counts 
in the outcome, and method of calculating the b path. In addition to main ef- 
fects, all possible two- and three-way interactions were included as predictors 
in each ANOVA. Only ANOVA estimates that were significant at p < .05 with 
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corresponding partial 7? values of .02 (small amount of variance explained) or 
higher were considered meaningfully significant for the interpretation of results. 


3 Results 


3.1 Relative Difference 


The average relative difference over replications for each condition is shown in 
Table 3. Results from the ANOVAs for both X = 0 and X = 1 indicated that only 
the method of calculating the b path (bza vs. bL) was a meaningfully significant 
predictor of relative difference. Method of calculating the b path explained 20.6% 
and 15.1% of the variability in the outcome respectively, which were large effect 
sizes (X = 0: p < .001, partial 7? = .206; X = 1: p < .001, partial 7? = .151). 


Table 3. Relative Difference of Conditional Mediated Effects Collapsed Across Con- 
ditions 


ZINB 
X=0 X=1 
nr abit abLG abLL abra 
100 2.917 -0.052 4.620 -0.101 
250 3.226 0.017 3.857 -0.005 
500 3.089 0.011 3.330 0.000 
750 3.062 0.007 3.176 0.000 
1500 3.065 0.009 3.074 0.006 


ZIP 
X=0 X=1 
nr abit abLG abLL abra 
100 3.105 -0.001 4.599 -0.049 


250 3.243 0.029 3.765 0.011 
500 3.089 0.012 3.282 0.003 
750 3.076 0.012 3.169 0.007 
1500 3.066 0.012 3.073 0.010 


Examining Table 3 for X = 0, the average relative difference was around 3 
or above for all conditions for abr; estimates. When using the abza formula 
to calculate conditional mediated effects, the average relative difference only 
reached an absolute value above .05 for the ZINB model at the smallest sample 
size, and the average relative difference was otherwise extremely small for all 
conditions using the abrg formula. 

For X = 1, results from Table 3 indicate that the average relative difference of 
the estimates of abi, ranged from [3.073, 4.620] for all conditions, with average 
relative difference decreasing as sample size increased’. As with calculations for 


1 Sample size was a statistically significant predictor of relative difference for the 
ANOVA where X = 1, however the partial 7? < .02. 
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X = 0, the average relative difference of the conditional mediated effects using 
the abra formula only reached an absolute value above .05 for the condition 
fitting the ZINB model at the smallest sample size, and the average relative 
difference was otherwise not problematic (i.e., below an absolute value of .05) 
for all conditions using the abrg formula. 


3.2 Efficiency 


For all values of X and regardless of sample size and distribution of outcomes, 
estimates of abrg had smaller variability (i.e., were more efficient) than for esti- 
mates of aby, as shown by the averaged standard deviations of the estimates in 
Table 4. The difference in efficiency between the two sets of conditional mediated 
effect estimates decreased monotonically as sample size increased such that the 
conditional mediated effect estimates calculated with aby; were least efficient at 
the smallest sample sizes. 


Table 4. Efficiency of Conditional Mediated Effects Collapsed Across Conditions 


ZINB 

X=0 X 

n abit abia abrir abra 
100 0.152 0.036 0.242 0.033 
250 0.090 0.021 0.119 0.021 
500 0.061 0.015 0.073 0.015 
750 0.047 0.012 0.053 0.011 
1500 0.034 0.008 0.037 0.008 

ZIP 


II 
= 


n abLL abra abLL abLG 
100 0.147 0.034 0.221 0.032 
250 0.086 0.021 0.109 0.020 
500 0.059 0.014 0.069 0.014 
750 0.045 0.011 0.050 0.011 
1500 0.033 0.008 0.035 0.008 


3.3 Power 


Power values by condition can be found in Table 5. For conditional mediated 
effects where X = 0, there were negligible differences in power between the 
methods of calculating the b path. For conditional mediated effects where X = 
1, power was slightly larger for estimates of abzg than for estimates of abzz. 
Power increased as the sample size increased for all conditions, and there were 
negligible differences in power between the ZIP and ZINB conditions. Power 
never reached a level of .8 (Cohen, 1988) in any of the conditions, likely due to 
the small magnitude of the b paths. 
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Table 5. Power of Conditional Mediated Effects Collapsed Across Conditions 


ZINB 

X=0 X 

n abLL aba abLL abLG 
100 0.000 0.008 0.000 0.018 
250 0.068 0.106 0.000 0.118 
500 0.214 0.250 0.050 0.258 
750 0.400 0.418 0.222 0.420 
1500 0.692 0.700 0.612 0.702 

ZIP 


II 
= 


n abit abia abtL aba 
100 0.000 0.020 0.000 0.032 
250 0.090 0.120 0.000 0.134 
500 0.274 0.298 0.084 0.308 
750 0.432 0.442 0.286 0.446 
1500 0.728 0.730 0.670 0.732 


4 Discussion 


We used a Monte Carlo simulation to demonstrate the differences in results using 
log-linear vs. logistic regression equations for zeroes when calculating conditional 
mediated effects in a mediation model where Y is a ZI count. The conditional 
mediated effects for the zeroes in Y were calculated using two different b path 
formulas, br, and bra, where brga used the distributional form of the zeroes. 
Comparing estimates of abr; and abza, we found that the conditional mediated 
effects differed significantly in magnitude at both values of X, for both ZINB 
and ZIP models and across all sample sizes examined. Specifically, results for 
relative difference showed that estimates of abr, were significantly different from 
estimates of abra, and that this difference held across sample sizes and outcome 
distributions. Conditional mediated effects for zeroes calculated using the bre 
formula were also more efficient, and when X was non-zero, had slightly higher 
power. These results indicate that the choice of b path formula has a meaningful 
impact on the interpretation of results for the conditional mediated effects and 
should be considered when using this method to conduct mediation analysis with 
ZI count variables. 


For conditional mediated effects calculated at non-zero X, power was slightly 
higher for abra than for abzz. This means that when using the different for- 
mulas for b, we may make different conclusions about the significance of the 
conditional mediated effects when X is non-zero (for example, we could observe 
that the conditional mediated effect aby 7; 1) was not significant and abzG{x=1] 
was significant), which further highlights the importance of considering the dis- 
tributional form of the zeroes when conducting mediation analysis where ZI 
counts are present in the data. 
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In this paper, we focused specifically on the mediation model that contained 
Y as a ZI count, meaning we discussed the issue of separate distributions of 
zeroes and counts with respect to only the b path (from M to Y) in mediation. 
However, this issue is applicable as well to models with a ZI count mediator, in 
which case we would calculate an a path using a log-linear regression equation 
for counts in M and an a path using a logistic regression equation for zeroes in 
M. Furthermore, it would be possible to use this method in a model where both 
M and Y are ZI counts. Under such circumstances, the a path for the counts 
in M could be calculated using the first partial derivative with respect to X of 
the log-linear transformation of Equation 2, and the b path for the counts in Y 
could be calculated using Equation 7. The a and b paths for the zeroes in M 
and Y would then be calculated using the first partial derivatives of the logit 
transformations of their respective regression equations (Equation 11 for the b 
path). 

The utilization of these formulas can also be applied to future methodolog- 
ical work on mediation analysis with ZI count variables. The method described 
in this paper that is an extension of O’Rourke and Vazquez (2019) can be ex- 
tended to more complex models that are frequently used by applied researchers, 
such as models that incorporate time (i.e., longitudinal models). This process of 
mediation can be expanded to ZI-GZLMs for repeated measures nested within 
individuals that are fitted in the multilevel modeling framework. 

It is important for researchers to have accessible methods of assessing me- 
diation in complex nonlinear models. This paper advances accessible methodol- 
ogy in the pursuit of best practices for investigating mediators when data are 
nonnormal. The simulation results presented here highlight the complexity of 
calculating mediated effects in models where ZI counts are present. 
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Abstract. Machine learning methods are being increasingly adopted in behav- 
ioral research. Lasso regression performs variable selection and regularization, 
and is particularly appealing to behavioral researchers because of its connection 
to linear regression. Researchers may expect properties of linear regression to 
translate to lasso, but we demonstrate that this assumption is problematic for mod- 
els with categorical predictors. Specifically, we demonstrate that while the coding 
strategy used for categorical predictors does not impact the performance of linear 
regression, it does impact lasso’s performance. Group lasso is an alternative to 
lasso for models with categorical predictors. We investigate the discrepancy be- 
tween lasso and group lasso models using a real data set: lasso performs different 
variable selection and has different prediction accuracy depending on the coding 
strategy, while group lasso performs consistent variable selection but has different 
prediction accuracy. Using a Monte Carlo simulation, we demonstrate a specific 
case where group lasso tends to include many variables when few are needed, 
leading to overfitting. We conclude with recommended solutions to this issue and 
future directions of exploration to improve the implementation of machine learn- 
ing approaches in behavioral science. This project shows that when using lasso 
and group lasso with categorical predictors, the choice of coding strategy should 
not be ignored. 


Keywords: Lasso regression - Categorical predictors - Regularization 


1 Introduction 


Many behavioral research questions involve categorical predictors, including edu- 
cation, ethnicity, religion, gender, or experimental conditions. Unlike numerical predic- 
tors, which typically have a natural scale, to be included in statistical models categorical 
predictors require researchers to select a method for encoding these variables (i.e., rep- 
resenting the categories using a numeric system). Thus, a single categorical predictor 
can be represented in a model using different sets of variables, each set embodying the 
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same predictor but representing different contrasts of the categories. This special prop- 
erty of categorical predictors motivates our exploration of categorical predictors in the 
case of linear regression and two machine learning algorithms: least absolute shrinkage 
and selection operator (lasso; Tibshirani, 1996) and group lasso regression (Yuan & Lin, 
2006). We explore both variable selection and prediction accuracy for these models and 
how they are impacted by using different coding strategies for categorical predictors 
using a real-world data set. 

We use a data set focusing on stress during COVID-19 as the primary outcome, 
measured in over 100,000 participants (Yamada et al., 2021). The stress score is an 
aggregated score from the Perceived Stress Scale (PSS-10) on a 1-5 scale. The data 
set includes categorical predictors, such as Education, Gender and Marital Status, and 
continuous predictors, such as Age and Trust in the Country. The overall goal is to 
predict participant’s Stress using the available predictors. 

In the remainder of this section, we introduce the three analytical approaches exam- 
ined in this paper: linear regression, lasso regression, and group lasso regression. We 
focus on the application of these methods with a continuous outcome and one or more 
categorical predictors. After introducing these methods, we demonstrate their use with 
the applied example, exploring peculiar behavior of the machine learning approaches 
that does not occur with linear regression. 


1.1 Linear Regression With Categorical Predictors 


Categorical predictors need to be encoded into a set of variables to be included 
in regression models. Different coding strategies can be implemented, such as dummy, 
contrast, sequential, or Helmert coding. Tables 1—4 show different ways to encode a cat- 
egorical variable, Education, with 7 categories (no education, up to 6 years of school, 
up to 9 years of school, up to 12 years of school, some college or equivalent, college 
degree, PhD/doctorate). Dummy coding uses only 0’s and 1’s to indicate category mem- 
bership. One category is selected as the reference category (or reference group) and is 
assigned a score of 0 on all indicators. For other categories, only the indicator corre- 
sponding to the category is coded as 1 and all other indicators are set to 0 (Table 1). 
Contrast coding is similar to dummy coding, but the reference category which is coded 
as all 0 in dummy coding is now coded with all -1 instead, changing the interpreta- 
tion of the intercept and slope coefficients (Table 2). Sequential coding compares each 
category to the previous category (Table 3), while Helmert coding examines how each 
category is compared to the average of all subsequent categories (Table 4). Note that 
if a categorical variable has k categories, k — 1 indicators are needed, regardless of the 
coding strategies used. This type of design matrix is defined as nonsingular because 
the matrix is invertible. The design matrix has to be nonsingular for linear regression 
but this is not necessarily the case for lasso or group lasso. In Appendix B we discuss 
singular matrix options for lasso regression. 

In linear regression, each coding scheme represents categories using a different nu- 
merical system, which leads to different interpretations of their coefficients. However, 
each coding scheme always predicts the category mean for each category (or adjusted 
means if covariates are included), and the explained variance is the same regardless of 
coding choice (Darlington & Hayes, 2016). Therefore, researchers can choose coding 
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Table 1: Dummy Coding 


Education Dı Dz D3 D4 Ds D6 
1. no education 0 0 0 0 0 0 
2. up to 6 years of school 1 0 0 0 0 0 
3. up to 9 years of school 0 1 0 0 0 0 
4. up to 12 years of school 0 0 1 0 0 0 
5. some college or equivalent 0 0 0 1 0 0 
6. college degree 0 0 0 0 1 0 
7. PhD/doctorate 0 0 0 0 0 1 


Note. No education is selected as the reference group (coded 0 on all indicators) and every other 
category scores | on a single indicator and 0 on all other indicators. 


Table 2: Contrast Coding 


Education Ci C2 C3 C4 C5 C6 
1. no education 1 0 0 0 0 0 
2. up to 6 years of school 0 1 0 0 0 0 
3. up to 9 years of school 0 0 1 0 0 0 
4. up to 12 years of school 0 0 0 1 0 0 
5. some college or equivalent 0 0 0 0 1 0 
6. college degree 0 0 0 0 0 1 
7. PhD/doctorate -1 -1 -1 -1 -1 -1 


Note. PhD/doctorate is selected as the omitted category (coded -1 on all indicators) and every 
other category scores | on a single indicator and 0 on all other indicators. 


strategies among all these options according to their needs without concern about model 
performance. Dummy and contrast coding are often used for nominal categorical vari- 
ables, while sequential and Helmert coding are particularly helpful when categories are 
ordered. 

When using different coding strategies, the regression coefficients have different 
interpretations. For example, a researcher might want to know whether Stress during 
the COVID-19 pandemic can be predicted by Education. The seven categories within 
the variable Education are encoded by 6 indicators. Linear regression fits the following 
model: 

Y; = Bo + BiX1i + B2X2i + B3X3i + BaXai + BsX5; + BoXoi + £i, (1) 


where Y; is the outcome value for the i” observation (person), X ji is the j!” variable 


to convey category membership for the i” observation, and g; is the error term for the 
i” observation. Equation 1 is the general equation for all coding strategies. If different 
coding strategies are used, the intercept Bo and coefficients for different indicators, By 
through p6, have different meanings. For example, suppose the fitted linear regression 


model (with Y; representing the predicted value for the i” observation) is 


¥; = 2 + 0.3X1; + 1.5Xo; + 0.2X3; + 0.5X4; — 0.2X5; — 0.4X6;. (2) 


The interpretation of these coefficients would depend on which coding strategy was 
used. If dummy coding was used with no education as the reference group (as in Table 
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Table 3: Sequential Coding 

Education S 
. no education 0 
. up to 6 years of school 1 
. up to 9 years of school 1 
1 

1 

1 


i 
K 
A 
D 


. up to 12 years of school 
. some college or equivalent 
. college degree 

7. PhD/doctorate 1 
Note. The lowest category scores 0 on all indicators. Each subsequent category scores 1 on one 
more indicator than the previous. 


NN WN KE 
ePeePrro am 


ePrereProco adm 


Table 4: Helmert Coding 


Education A Ab A H4 Hs Ho 
1. no education -6/7 0 0 0 0 0 
2. up to 6 years of school 1/7 -5/6 0 0 0 0 
3. up to 9 years of school 1/7 1/6 -4/5 0 0 0 
4. up to 12 years of school 1/7 1/6 1/5 -3/4 0 0 
5. some college or equivalent 1/7 1/6 1/5 1/4 -2/3 0 
6. college degree 1/7 1/6 1/5 1/4 1/3 -1/2 
7. PhD/doctorate 1/7 1/6 1/5 1/4 1/3 1/2 


Note. The lowest indicator scores — (k — 1)/k on the first indicator and 0 on all subsequent 
indicators. The next highest scores 1/k on the first indicator, — (k — 2)/(k — 1) on the second 
indicator, and 0 on all subsequent indicators. The next highest scores 1 /k on the first indicator, 
1/(k— 1) on the second indicator, — (k — 3)/(k — 2) on the third indicator, and 0 on all 
subsequent indicators. And so on. 


1), we would interpret the coefficient for X4, 0.5, as the difference between the average 
stress score of individuals with no education and the average stress score with some 
college education. However, if contrast coding was used (as in Table 2), 0.5 would in- 
dicate the difference between the average stress score of individuals with up to 12 years 
of school and the average score of all categories. If sequential coding was used (as in 
Table 3), 0.5 would be interpreted as the difference between the average stress score of 
individuals with some college education and the average stress score of individuals with 
up to 12 years of school. If Helmert coding was used (as in Table 4), 0.5 would indicate 
that on average individuals with up to 12 years of school are 0.5 points less stressed 
than the average of those who have some college education, those who have a college 
degree and those who have a PhD/Doctorate. The interpretations of the coefficients are 
inseparable from the coding strategy used. 

Different selections of reference categories in dummy and contrast coding and or- 
dering of categories in Helmert and sequential coding can also produce coefficients with 
different meanings. For example, if no education is the reference category for dummy 
coding, Bo represents the average stress score for people with no education and fı 
through Be will represent the difference between no education and the corresponding 
coded category. On the other hand, if up to 6 years of school is the reference cate- 
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gory, Bo represents the average stress for individuals with up to 6 years of school, and 
Pı through pe will represent the difference between “up to 6 years of school” and the 
corresponding coded category. 

Though different ways to code categorical variables produce different model coef- 
ficients, they do not affect the predictions/prediction accuracy of linear regression. To 
demonstrate that linear regression with a categorical predictor will predict the same cat- 
egory means for each coding scheme, we used Education to predict Stress. We randomly 
sampled 10,000 participants from the COVID-19 Stress Data (Yamada et al., 2021) to 
serve as our sample data set, and then we randomly split our sample into training (80%) 
and test (20%) data. Next, we fit linear regression on the training data set with four 
different coding strategies from Tables 1 - 4 applied to the variable Education. Table 5 
contains the model coefficients. 


Table 5: Linear Regression Example for Coding 


Coefficient Dummy Contrast Sequential Helmert 
Bo 2.852 2.955 2.852 2.955 
By 0.031 -0.103 0.031 0.121 
Bo 0.110 -0.072 0.079 0.107 
Bs 0.145 0.007 0.035 0.036 
By 0.161 0.041 0.016 0.001 
Bs 0.138 0.058 -0.023 -0.022 
Be 0.139 0.035 0.001 0.001 


Note. Each column of the table represents one coding strategy and rows represent the 
coefficients of the indicator X; for each coding strategy. 


Using the values of X;—X6 from Table 1—4 and the coefficient estimates from Table 
5, we reconstruct the predicted score (i.e., category mean) for the “some college or 
equivalent” category for dummy, contrast, sequential, and Helmert coding respectively. 


2.852 + 0.031(0) +0.110(0) +.0.145(0) +0.161(1) +0.138(0) +0.139(0) =3.013 (Dummy 
2.955 — 0.103(0) — 0.072(0) + 0.007 (0) + 0.041 (0) +.0.058(1) +0.035(0) =3.013 (Contrast 
2.852 +.0.031(1) +0.079(1) +0.035(1) +0.016(1) — 0.023(0) +.0.001(0) = 3.013 (Sequential 

1 1 1 1 2 
2.955 +0.121 (7) +0.107( =) +0.036( <) +0.001 (7) —0.022(— Z) +0.001(0) =3.013 (Helmert 


The predicted score for “some college or equivalent” using dummy coding is the same 
as that for contrast, sequential, and Helmert coding. Following a similar procedure, 
it can be shown that all predicted scores match the category means for each coding 
strategy (Cohen, Cohen, West, & Aiken, 2003; Darlington & Hayes, 2016). 

Since predicted scores are the same across coding strategies in linear regression, this 
means prediction accuracy is also the same across the different coding strategies. In our 
example data, prediction accuracy quantifies how far a model’s predicted stress scores 
are from the observed stress scores of participants in the test data. We use Mean Squared 


) 
) 
) 
) 
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Error (MSE) to measure the prediction accuracy. Mathematically, MSE is calculated as 
MSE = —Y (Y;—Y¥;,)’, 3 
- LI ) (3) 


where n represents the number of observations in the test data; Y; represents the ob- 
served outcome value of the i” observation in the test data; and Y; represents the pre- 
dicted outcome value of the i” observation from the model (which is generated using 
the training data). When we calculate the MSE of four linear regression models each 
fit using one of the four coding strategies mentioned previously, we find that all models 
have the exact same MSE of 0.13674. This illustrates that prediction accuracy is not 
affected by coding strategy when using linear regression. 

While these results may seem trivial and require only a basic understanding of linear 
regression to understand, they stand in stark contrast to similar results we will examine 
in alternative regularized regression approaches. In summary, linear regression models 
with different coding strategies predict the same scores (i.e., category means) and give 
the same prediction accuracy, though they produce different coefficients. These proper- 
ties persist when there are additional predictors (categorical and/or continuous) in the 
model, where the predicted scores (which are now adjusted means) are the same for all 
coding strategies, and thus prediction accuracy is always the same as well. 


1.2 Lasso and Group Lasso Regression 


In contrast to linear regression, lasso regression is useful when the proposed model 
involves many predictors, but only a few may be true predictors of the outcome (i.e., 
sparsity). Lasso is gaining popularity in behavioral science presumably because it shares 
many properties with linear regression, an already common statistical approach in the 
field (McNeish, 2015). For example, a lasso model fit to the COVID-19 data using 
Education to predict Stress would share the same equation as linear regression given 
in Equation 1. However, the values of the p; coefficients would differ between the 
two methods because linear and lasso regression differ in the way they estimate the 
vector containing these regression coefficients, 8. In linear regression, the estimated 
coefficient vector is calculated as follows, 


Êiinear = en = XB 2), (4) 


where |:|» is the notation for the L2 norm. Lasso, on the other hand, adds a penalty term 
governed by the penalty parameter À to regulate the size of the coefficients: 


Biasso = i a -X| +All), (5) 


where |:|; is the notation for the L1 norm.! When A is nonzero, nonzero values of 
B result in increases in A|B|,, and so Equation 5 reaches its minimum when both the 


' Another alternative to lasso is ridge regression which is expressed by Equation 5 except with 
an L2 norm instead of an L1 norm for the regularization term. In Equation 5, the L1 norm 
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prediction error and the size of the elements of B are considered. A large À value results 
in the coefficients in B being shrunk toward or equal to zero so fewer predictor variables 
are selected in the model (where “selected” means that the coefficient is nonzero in the 
final solution). A small A value, on the other hand, results in less shrinkage so more 
predictor variables can be selected into the model. Linear regression is actually a special 
case of lasso regression when A is set to zero. 

While lasso has many benefits over linear regression (Hastie & Tibshirani, 2018; 
McNeish, 2015; Tibshirani, 1996), when applying lasso regression to models with cat- 
egorical predictors, additional considerations must be made. Lasso regression models 
select variables based on the penalty parameter A and the sizes of the entries in coef- 
ficient vector B. However, as we demonstrated with linear regression, using different 
coding strategies for a categorical predictor creates models with different coefficient 
vectors. This means that the choice of coding strategy may result in different variable 
selection in lasso regression models. The issue of coding strategies is related to the 
issue of variable scaling with continuous predictors, which also influences variable se- 
lection and prediction accuracy in lasso regression models. One common solution to 
this problem is to standardize all continuous predictors before applying lasso regres- 
sion (Marquardt, 1980). In this way, the effect of scaling is excluded from the variable 
selection of lasso regression with continuous predictors. While dichotomous variables 
can be standardized, different coding strategies representing more than two categories 
do not result in the same standardized solution. Given this, there is reason to believe 
that the performance of lasso regression with categorical variables may be impacted by 
the choice of coding strategies for those variables. 

A generalization of lasso regression which may also be impacted by coding strategy — 
but in different ways—is group lasso regression. Group lasso, as opposed to lasso, per- 
forms variable selection by selecting groups of variables rather than individual variables 
(Yuan & Lin, 2006). This is particularly valuable for the case of categorical predictors 
because the set of indicators for each variable forms a natural group. The mathematical 
formula for estimating the coefficient vector f in group lasso is 


G 
Beroup = a L IAR) (6) 
g=l 


where G represents the number of groups of variables, and B;, represents the coefficient 
vector of that corresponding group. Other notation is the same as Equation 5. Using 
the L2 norm within each group g is what allows group lasso to either select all or 
none of the variables within each group. Also, multiplying by À after summing the L2 
norms of all groups penalizes each group instead of each individual indicator variable. 
These differences provide group lasso with distinct properties: When all variables are 
considered one group, group lasso performs as ridge regression. On the other hand, 
when all the variables are their own group, group lasso performs as lasso regression. 


penalizes the absolute value of the coefficients, used by lasso; while in ridge regression, the 
L2 norm penalizes the squares of all coefficients. Given this property, ridge regression is not 
as effective at penalizing parameters to zero compared to lasso regression (Tibshirani, 1996). 
Therefore, lasso regression is preferred for variable selection. 
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The advantage of group lasso is that when there are multiple groups of more than one 
variable, the result is a combination of within-group ridge regression and across-group 
lasso regression. 

The group lasso has special properties with respect to variable selection. Within a 
group, group lasso typically includes or excludes all variables because of the within- 
group ridge regression. Given its unique properties with respect to variable selection, 
group lasso has been recommended as a useful alternative to lasso regression when 
dealing with models with categorical variables (Detmer, Cebral, & Slawski, 2020; Mc- 
Neish, 2015); however, no prior research has explored the sensitivity of group lasso to 
different coding strategies. In group lasso, all indicators for a categorical variable are 
defined as a group, and the algorithm should either include all indicators associated 
with one categorical predictor or exclude all these indicators. 


1.3 Motivation 


With the increasing use of lasso techniques across scientific fields, but especially 
within the social and behavioral sciences, many researchers rely on their intuitions 
about the similarities between lasso and linear regression to understand, use, and in- 
terpret the results of lasso regression. This could be particularly problematic for models 
with categorical predictors. Prediction accuracy in linear regression is unaffected by the 
selection of coding strategy; however, lasso regression conducts regularization by min- 
imizing regression coefficients, which differ across coding strategies. This may lead to 
different prediction accuracy and variable selection depending on the coding strategy 
used when using lasso. Since group lasso treats the variables in a group as a whole set, 
it seems less likely that its variable selection will be impacted by the choice of coding 
strategy. However, the prediction accuracy of group lasso may still be impacted by the 
coding strategy. 

To explore the potential impacts of coding strategy on important characteristics of 
lasso and group lasso regression, we combine both real data analysis and simulation. 
First, using the COVID stress data set described previously, we demonstrate the use 
of lasso and group lasso regression with categorical variables, where different coding 
strategies of categorical variables impact two aspects of model performance: variable 
selection and prediction accuracy. Next, we use a Monte Carlo simulation to demon- 
strate a specific case where group lasso may tend to overfit the training data. In the last 
section, we explore other potential solutions, important future directions, and general 
conclusions. 


2 Real Data Analysis with COVID Stress Data 


We used the COVID stress data set with the same sample of 10,000 participants 
and the same training/test data sets used in Section 1.1 to explore how coding strategies 
affect models estimated by lasso and group lasso. In the models, we included six cat- 
egorical predictors (where a predictor with k categories was represented by k — 1 indi- 
cator variables): Education (7 categories), Employment status (6 categories), Gender (3 
categories), Isolation status (4 categories), Marital status (4 categories), and Mother’s 
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education (7 categories). We also included seven continuous predictors in the models. 
Thus, after coding all categorical variables and adding the seven continuous variables, 
the models predicted the outcome Stress with 6+5+2+3+3+6+7 = 32 variables. 
In total, we trained eight different models using lasso and group lasso with four coding 
strategies: dummy, contrast, sequential, and Helmert. We used 10-fold cross-validation 
on the training data to select the penalty parameter from the model with the best predic- 
tion accuracy, so the penalty parameter that was selected is different across models with 
different coding strategies.” We then examined if the variable selection and prediction 
accuracy of these lasso and group lasso models were affected by the choice of coding 
strategy. 


2.1 Variable Selection 


We first examined differences in the variable selections of the four lasso models. 
Results are shown in Table 6. Focusing on the Education variable, we illustrate how 
the use of different coding strategies can result in conflicting findings. Both the dummy 
coding model and the sequential coding model have a predictor which represents the 
difference between no education and 6 years of education. After applying lasso, the 
dummy coding model includes this predictor, whereas the sequential coding model 
excludes this predictor. Based on these results, using the dummy coded model, a re- 
searcher might conclude that COVID stress differs across the no education and 6 years 
of education groups, whereas using a sequential coded model, the opposite conclusion 
would be made. 

Fitting similar dummy-, contrast-, sequential-, and Helmert-coded models with group 
lasso, we found that the results differed notably from the traditional lasso. While lasso’s 
variable selection was affected by the choice of coding strategy (see Table 6), the group 
lasso’s variable selection seemed stable across different coding strategies, with all pre- 
dictor variables selected to remain in all four models. Thus, based on the applied data 
analysis, it seems that variable selection is not impacted by the coding strategy for 
group lasso, though this should be subject to additional investigation. This suggests that 
if researchers are interested in using lasso for variable selection and have categorical 
predictors, using group lasso could avoid the arbitrary choice of coding strategy. How- 
ever, group lasso was not successful in reducing the set of potential predictors, and thus, 
it may suffer from a limitation of being overly inclusive. We explore this issue more in 
a simulation. 


2.2 Prediction Accuracy 


In this section, we investigate whether prediction accuracy is affected by the choice 
of coding strategy using both lasso and group lasso. We examined the prediction accu- 
racy in two ways: predicted category scores and MSE of the model applied to the test 
data set. 


2 Note that even with the same penalty parameter, models with different coding strategies or 
reference categories will still have different variable selection and prediction accuracy. 
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Table 6: Variable Selection for Different Coding Strategies by Lasso 


Lasso Regression 


Variable Dummy Contrast 

6 years - no no - Average 

9 years - no 6 years - Average 
picasa 12 years - no © 9years- Average 
some college - no 


PhD - no 


12 years - Average 
some college - Average 


Employment Status 


part-time - no 


student - no 
full-time - no 
retired - no 


no - Average 


self-employed - Average 
student - Average 
full-time - Average 


Sequential 


9 years - 6 years 
12 years - 9 years 
some college - 12 years 
college - some college 
PhD - college 


Helmert 
no - Average(6 years and more) 
6 years - Average(9 years and more) 
| 9 years - Average(12 years and more) — 
12 years - Average(some college, college, PhD) 
some college - Average(college + PhD) 
college - PhD 


part-time - no 
self-employed - part-time 
student - self-employed 
full-time - student 
retired - full-time 


no - Average(part-time, self-employed, student, full-time, retired) 
self-employed - Average(student, full-time, retired) 
student - Average(full-time, retired) 
full-time - retired 


Gender 


man - woman 
other - woman 


woman - Average 
man - Average 


man - woman 
other - man 


woman - Average(man, other) 
man - other 


Isolation Status 


minor changes - usual 
isolated - usual 
medical isolated - usual 


usual - Average 


isolated - Average 


minor changes - usual 
isolated - minor changes 
medical isolated - isolated 


usual - Average(minor changes, isolated, medical isolated) 
minor changes - Average (isolated, medical isolated) 
isolated - medical isolated 


Marital Status 


divorced - single 
married - single 
other - single 


single - Average 
divorced - Average 
married - Average 


divorced - single 
married - divorced 
other - married 


single - Average(divorced, married, other) 
divorced - Average(married, other) 
married - other 


Mom’s Education 


6 years - no 


12 years - no 
some college - no 
college - no 


no - Average 
6 years - Average 
9 years - Average 
12 years - Average 
some college - Average 
college - Average 


6 years - no 
9 years - 6 years 
12 years - 9 years 
some college - 12 years 
college - some college 
PhD - college 


no - Average(6 years and more) 
6 years - Average(9 years and more) 
9 years - Average(12 years and more) 
12 years - Average(some college, college, PhD) 
some college - Average(college, PhD) 
college - PhD 


Note. Variables with a white background color were selected to be in the model, and variables with a grey background color were not selected. 
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Predicted Category Scores We first examined whether the predicted stress score for 
each Education group is the same with different coding strategies in lasso and group 
lasso models. In this section, We generated the predicted score for each category using a 
model with only Education as a predictor, so the models contained 6 indicator variables 
in total. While this model is oversimplified, it eases the direct comparison between the 
true means of each group and the predicted scores. 

The predicted category scores for lasso models fit using the four different coding 
strategies are shown in Table 7, with the final column providing the actual category 
means for Education observed in the training data. First off, it is important to note 
that category scores shown in the table were rounded. Thus, some category scores that 
were very close to the actual category scores were rounded to the same value, but there 
were no lasso models where the predicted scores were exactly equal to the group means 
like they would have been in linear regression. Also, it is evident in the table that the 
predicted means often differ depending on the coding strategy used. For five of the 
seven categories, the dummy-coded model estimated the category mean most accurately 
among all models. 


Table 7: Predicted Category Scores for Different Coding Strategies by Lasso 


Dummy Contrast Sequential Helmert Training Mean Test Mean 


None 2.864 2.864 2.872 2.852 2.912 
6 years 2.897 2.888 2.896 2.883 2.824 
9 years 2.962 2.965 2.973 2.962 2.857 
12 years 2.997 2.995 2.997 2.997 3.038 
Some college 3.013 3.012 3.008 3.013 3.009 
College 2.990 2.990 2.991 2.990 2.999 
PhD/Doctorate 2.991 E| 2.991 2.991 2.991 3.008 


Note. Rows represent Education categories, and the middle four columns give the model 
predicted values with different coding strategies. The last two columns give the actual mean of 
each category observed in the training and test data, respectively. The closest value to the 
training mean is bolded and the closest value to the test mean has a grey background color in 
each row. 


The results of the predicted category scores for the four group lasso models, shown 
in Table 8, are very similar to the lasso models: Group lasso estimated each category 
score within a categorical variable differently depending on the coding strategy used. 
Thus, although variable selection is not impacted by the coding strategy used for group 
lasso, the predicted category score is impacted by the choice of coding strategy. Also, 
among all group lasso models, the dummy-coded model generated the most accurate 
category scores for four of the seven categories. Thus, regardless of whether lasso or 
group lasso was used, the dummy-coded model estimated the majority of the category 
means better than the other three models. It is unclear whether this finding would remain 
true with other data sets, however. 

The results in Table 7 and 8 show that different coding strategies result in different 
predicted category scores. While this is an important finding, it is equally important 
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Table 8: Predicted Category Means for Different Coding Strategies by Group Lasso 


Dummy Contrast Sequential Helmert Training Mean Test Mean 


None 2.828 2.863 BÆ 2.868 2.852 2.912 
6 years mss 2.892 2.906 2.894 2.883 2.824 
9 years 2.965 2.962 [ROS 2.963 2.962 2.857 
12 years Ha 2.996 2.994 2.996 2.997 3.038 
Some college 3.013 3.012 3.012 3.013 3.009 
College 2.990 2.990 2.990 2.990 2.999 


PhD/Doctorate Ea 2.991 2.991 2.990 2.991 3.008 


Note. Same as Table 7. 


to understand why this occurs and whether the degree of difference is predictable and 
understandable rather than random variability due to estimation. A core aspect of lasso 
and group lasso models is shrinkage: different coding strategies will result in different 
model intercepts and coefficients, because the degree of shrinkage is different across 
coding strategies. 

To visualize the shrinkage effect of each coding strategy, we plotted the predicted 
scores from each lasso model along with each model’s intercept in Figure 1. In the 
dummy-coded model, the predicted scores are all shrunk slightly toward the no edu- 
cation category score (since it is the intercept in this model) relative to the contrast- 
coded model, where the scores are instead all pulled closer to the grand mean (i.e., the 
model’s intercept). The predicted scores from the sequential-coded and Helmert-coded 
models, on the other hand, are shrunk closer towards each other more than those from 
the dummy-coded or contrast-coded models, reflecting the fact that shrinkage in se- 
quential coding and Helmert coding relies not on the intercept, but on the differences 
between neighboring categories or the average of multiple neighboring categories. For 
example, the 9 years and the some college categories are shrunk closer to the college 
category or Phd/Doctorate category in sequential-coded and Helmert-coded models. In 
summary, models fit with different coding strategies have different shrinkage patterns, 
and so predicted scores differ across these models, leading to different prediction accu- 
racy. These results suggest that one way to select a coding strategy is to consider the 
pattern of shrinkage which seems most reasonable. 


Model Fit Next, we recorded MSEs calculated from models including all six categor- 
ical variables and all seven continuous variables, to the test data set (Table 9). Model 
fit (MSE) differs by coding strategy for both lasso and group lasso. Contrast-coded 
models yielded the best MSE for both lasso and group lasso regression. This exposes 
uncertainty regarding which coding strategy should be used when lasso or group lasso 
regression is applied. While some differences in MSE are expected due to the stochas- 
tic nature of procedures like cross-validation used to choose the penalty parameter (A), 
it is notable that the MSEs were more variable for the group lasso models than they 
were for the lasso models, suggesting that choice of coding strategy could result in 
a much less optimal model (possibly worse than linear regression) when using group 
lasso. We explore this issue more in the Monte Carlo simulation. In Appendix A, we 
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Figure 1: Graphical Presentation of Category Means for Education Recreated by Lasso 
Models with Different Coding Strategies. Intercept values are different across cod- 
ing strategies. The intercept value is the estimated category mean for no education in 
dummy and sequential coding and the average of the category means in contrast and 
Helmert coding. 


demonstrate similar issues with the choice of reference group or category order across 
different coding strategies, and in Appendix B we demonstrate that the use of singular 
design matrices (e.g., including dummy codes for all categories) does not ameliorate 
this issue. 


Table 9: Model Fit (MSE) for Different Coding Strategies by Lasso and Group Lasso 


Regression 
Coding strategies Dummy Contrast Sequential Helmert 
Lasso Regression 0.13669 0.13660 0.13675 0.13677 
Group Lasso Regression 0.13711 0.13689 0.13691 0.13695 


Note. Rows represent different lasso methods, and columns represent models with different 
coding strategies. The lowest value (best prediction) in each row is bolded. 


2.3 Summary 


Choice of coding strategy has the potential to affect both variable selection and 
prediction accuracy in lasso regression models. As a result, depending on the coding 
strategy used, an analyst may end up with different variables included in their model, 
different predicted scores, and different prediction accuracy. With both the model’s vari- 
able selection and predictive performance dependent on how categorical predictors are 
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represented in the model, it is not a choice that should be taken lightly. Ideally, there 
would be a method which provides the same variable selection and the same predicted 
scores regardless of the coding strategy chosen. 

Group lasso partly addresses the issues caused by the choice of different coding 
strategies in lasso regression, because group lasso’s variable selection is not affected 
by the coding strategy used. Therefore, if researchers use group lasso to select which 
variables contribute to the outcome variable, they do not need to worry that different 
coding strategies may result in different conclusions. However, coding strategies still 
affect the prediction accuracy of group lasso models. Therefore, if researchers aim to 
predict the outcome variable by using group lasso regression, they need to be aware 
that different coding strategies can result in different prediction accuracy. In addition, 
because group lasso is selecting more variables into the model, the robustness of group 
lasso across coding strategies may come at the cost of prediction accuracy. Comparing 
the MSEs between the lasso models and group lasso models, the lasso models typically 
have lower MSE (i.e., better prediction accuracy) than group lasso. 

This trade-off between prediction accuracy and robustness leads to some additional 
concerns about the group lasso. There seems to be a trade-off between including a set 
of predictors in a model, as compared to when a specific predictor. For example, if 
the average stress for all levels of education was the same except for those with PhDs, 
would group lasso still select the education set of variables into the model? Will the set 
of indicators for the categorical variables be selected if there is only one category that 
differs from the other categories within that variable? If this group is selected into the 
model, this means that many additional parameters would also be included to capture an 
effect that is only attributable to one indicator variable. Alternatively, if the group is not 
selected, then the predictive ability of the group lasso model may suffer. This problem 
does not occur with lasso, as it is able to include a single indicator variable to represent 
one category differing from the rest. Next, we explore this specific case and examine if 
group lasso’s ability to include groups of variables leads to issues with overfitting. 


3 Monte Carlo Simulation 


In this section, we use a Monte Carlo simulation to explore a potential weakness of 
group lasso: overfitting. Group lasso may select more variables than necessary into the 
model, leading to larger variance and lower prediction accuracy. We explore a partic- 
ularly extreme data generation case, where across all categories within one categorical 
variable, only one category differs from the rest. We call this category the dominant 
category and refer to all others as non-predictive categories. A non-predictive category 
is always used as the reference category in the analysis. While the simulation is much 
simpler than cases that would occur in real data analysis, it provides a clear demonstra- 
tion of a pattern that is likely to occur and be problematic and hard to identify in more 
complex situations. 


Simulation Method The data was generated such that the dominant category had a 
nonzero category mean, while non-predictive categories all had category means of zero. 
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All categorical variables were encoded using dummy coding. A second predictor vari- 
able was generated to follow a standard normal distribution. The outcome variable was 
created by adding the category mean, the value of the continuous variable, and a random 
error term drawn from a standard normal distribution. For optimal prediction, both the 
continuous predictor and the indicator variable which estimates the difference between 
the dominant category and other non-predictive categories should be included in the 
model, while the variables associated with non-predictive categories should not. 

As previously mentioned, the number of categories within categorical predictors 
may affect how the coefficients are estimated and how the model selects predictors in 
group lasso. Therefore, we varied the number of non-predictive categories (2,3,4). To 
examine how the effect size would affect group lasso’s prediction accuracy and variable 
selection, we also simulated different dominant category means (0.1, 0.2, 0.3). For each 
combination of number of categories and effect size, we randomly generated 500 data 
sets with a sample size of 1200. 

For each data set, we first split the data set into training and test sets randomly based 
on an 8:2 ratio. Then we fit lasso and group lasso models with the same training data. 
We selected the penalty parameter using the same cross-validation methods used in 
previous sections. For each model, we calculated the MSE, whether the model included 
the dominant category, and whether the model included the non-predictive categories. 
We calculated the average prediction accuracy of each method as well as the proportion 
of models that included the dominant category and the proportion that included non- 
predictive categories across each condition. For group lasso, these two proportions were 
always the same because group lasso either includes or excludes all categories within 
the categorical predictor. 


Simulation Results We first found that in all conditions lasso had a higher prediction 
accuracy than group lasso, indicated by lower MSEs (Table 10). Though the differ- 
ences in MSE of lasso and group lasso were small, they were consistent across different 
conditions. Secondly, for both group lasso and lasso regression, when the number of 
non-predictive categories increased, the probability for models to include the dominant 
category decreased, but the probability for lasso was consistently greater than or equal 
to that for group lasso (Figure 3). This means that lasso is more likely to include the 
dominant category than group lasso across the number of non-predictive groups. Figure 
2 shows that when the number of non-predictive categories stayed the same, the prob- 
ability for group lasso to include non-predictive categories increased when the effect 
size increased, while the probability for lasso remained relatively flat. For both mod- 
els, the probability of including non-predictive categories decreased as the number of 
non-predictive categories increased. 

Returning to the potential issue of overfitting in group lasso, consider the case where 
the dominant group mean is large. Figure 2 shows that when the dominant group mean 
was 0.3, group lasso had a higher probability than lasso of including non-predictive 
categories. In this case, group lasso could overfit the data because group lasso was more 
likely to include categories that were not supposed to be in the model. This also explains 
group lasso’s lower prediction accuracy than lasso in Table 10 when the dominant group 
mean was large. 
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Table 10: Differences in MSE of Lasso and Group Lasso Models for Monte Carlo Sim- 
ulation 


Number of Non-predictive Categories 


Dominant Category Mean 2 3 4 
0.1 0.0028 0.0029 0.0003 
0.2 0.0020 0.004 0.0008 
0.3 0.0029 0.0029 0.0030 


Note. Values larger than zero mean that the MSE for group lasso is larger than the MSE for 
lasso. 
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Figure 2: Comparison of Probabilities of Including Non-Predictive Categories under 
Different Numbers of Categories for Lasso and Group Lasso Models 
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Figure 3: Comparison of Probabilities of Including the Dominant Category under Dif- 
ferent Dominant Category Means for Lasso and Group Lasso Models 
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Simulation Summary Using Monte Carlo simulation, we demonstrated conditions 
under which group lasso may be likely to have issues with overfitting. When one or just 
a few categories differ from the rest, lasso may be more efficient with better prediction 
accuracy than group lasso. In these cases, group lasso is likely to include the categorical 
variable, including all non-predictive categories. Therefore, if researchers use group 
lasso to build predictive models, they may want to examine if one or two categories 
have relatively dominant means within categorical variables in advance, or if this pattern 
is hypothesized to occur they might prefer lasso. Looking for these effects may be 
particularly difficult in cases with many predictors where limited theoretical knowledge 
are driving the modeling, which is often the case when lasso is used. The differences 
must be conditional on all other variables in the data, not just examining the group 
means. If there are many categorical predictors in the model, exploratory analyses could 
be undertaken for each categorical variable, but this could a be tedious undertaking. 
Overall, this simulation demonstrates that there may be situations in which group lasso 
is not optimal for handling categorical predictors, especially if prediction accuracy is a 
high priority. 


4 Discussion 


In this paper, we demonstrate that lasso and group lasso models are sensitive to de- 
cisions about coding strategy for categorical predictors (e.g., dummy or sequential) and 
the choice of reference group/order of the categories (Appendix A). Linear regression 
does not have this problem, as the model fit and predicted values do not vary depend- 
ing on the coding strategy. Group lasso presents a partial solution by having consistent 
variable selection across coding strategies. However, this consistency may come at a 
cost of reduced prediction accuracy. Ultimately, this leaves open the question of which 
coding strategy should be chosen. In the next section, we explore potential solutions to 
this issue with categorical predictors in lasso-based models. 


4.1 Exploring Potential Solutions 


Regardless of which of the following solutions researchers choose, one thing is 
always required: transparency. In searching the literature for examples of applications of 
lasso with categorical predictors, we found very few teams reported the coding strategy 
or order of categories used. Researchers using categorical variables in lasso or group 
lasso regression need to report how they coded the variables (both coding strategy and 
variable order/reference group) as this is imperative for reproducing or replicating their 
results. The following are a few proposed solutions, none of which seem satisfactory 
for all cases. As such, we weigh the pros and cons of each and consider cases when 
each approach might be most acceptable. 


Prioritize Interpretability In cases where one coding strategy provides better inter- 
pretability of the model coefficients than another strategy, the most interpretable coding 
strategy could be chosen. This comes at the risk of having a worse predictive model, 
since the idea of interpretability is still very much rooted in the origins of inferential 
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rather than predictive statistical models. In particular, because the coefficient estimates 
in lasso regression are biased, they should not be interpreted directly. Rather, after vari- 
able selection is completed, common recommendations are to fit a linear regression 
model that only includes the selected variables (Hastie, Robert, & Wainwright, 2015). 
It would be unusual to include a coding strategy in the follow-up linear regression that is 
different from the strategy used in the lasso regression. Thus, researchers should choose 
the coding strategy for each categorical variable that would be most interpretable if that 
variable was selected by a variable selection procedure to remain in the model. Coding 
schemes like Helmert coding require the presence of all predictors to have the intended 
interpretation, and should perhaps only be used in concert with group lasso (ensuring 
all predictors are selected in or out of the model) if interpretability is the top priority. 
Notably, machine learning approaches are often used in cases where there are many 
variables included in the analysis, and relatively little theory regarding which variables 
should be predicting the outcome. This could make it difficult for the researcher (or an- 
alyst) to decide which coding scheme would be “most interpretable,” especially consid- 
ering the many possible combinations of coding schemes and variable orders/reference 
groups. 


Prioritize Robust Variable Selection Based on the real data analysis and the simula- 
tion results, the group lasso is robust to coding strategy choices with respect to variable 
selection. Prediction accuracy is not necessarily optimized for the group lasso. How- 
ever, when the goal is to select variables, and especially when it is conceptually useful 
to keep or drop all indicators for each categorical variable, group lasso seems to be an 
optimal choice. Nevertheless, this may come at a cost of prediction accuracy, particu- 
larly if categorical variables follow the dominant group pattern explored in the Monte 
Carlo simulation above, where one group is distinct from all other groups. 


Prioritize Prediction Another option when estimating lasso or group lasso models 
would be to try many different coding strategies in order to select the one with the best 
overall prediction accuracy. This process should likely be completed using the train- 
ing data so it does not influence the final prediction accuracy estimate acquired using 
an independent sample of the data. This approach can be very computationally inten- 
sive. With multiple categorical variables in the data set, trying different combinations 
of coding strategies would result in maximized prediction accuracy. 


Notably, if prediction accuracy is of the highest priority, alternative machine learn- 
ing approaches typically have higher prediction accuracy than lasso approaches, and 
many are robust to coding strategy. Techniques like classification and regression trees 
(CART) are unaffected by coding strategy because categorical predictors are treated as 
a single variable (Finch & Schneider, 2007). Realistically, researchers may be balancing 
their comfort with advanced analytic methods and their priority of prediction accuracy. 
CART methods do not provide the ”regression-like” estimates which many behavioral 
scientists rely on for interpreting their results. 
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4.2 Future Directions 


There are several future directions we believe would be particularly beneficial for 
improving the state of research in the area of (group) lasso regression with categorical 
predictors. The first is the concept of intercept penalization. The typical practice within 
lasso is not to penalize the intercept (Wu & Lange, 2008), but the interpretation of the 
intercept varies greatly depending on which coding scheme is used. For example, when 
dummy coding is used, the intercept is the average of the reference group. Alternatively, 
when contrast coding is used, the intercept is the average of all groups. Ultimately, 
this means that different group means have differential penalization depending on the 
coding strategy used (as reflected in Figure 1). Thus, it is worth investigating whether 
penalizing the intercept may be appropriate in certain cases, and whether this would im- 
prove prediction accuracy (just as penalizing all other regression coefficients improves 
prediction accuracy in lasso). This question remains largely unexplored and would be 
informative to researchers who are interested in improving prediction accuracy. 

Current defaults in software suggest that the field norm for coding strategy is dummy 
coding. The current research has demonstrated that dummy coding is a potentially risky 
choice as a default, as the choice of reference group can greatly impact the model, and 
the shrinkage is toward a group mean. Alternatively, contrast coding may make it an 
appealing default for researchers unsure about which coding strategy to use. Because 
the interpretation of the intercept for contrast coding is the average across all groups, 
the penalization of the groups is symmetric about this average. This means that when 
a coefficient is dropped from the model, the group that is indicated by this predictor 
is assumed to be equal to the grand mean. This method contrasts with dummy coding 
where all estimated group means are shrunk towards the reference group score. As a 
result, the selection of the reference group in contrast coding has less of an impact on 
parameter estimates than it does in dummy coding, because by selecting a reference 
group in dummy coding, that group’s score is not at all penalized (if the intercept is 
not penalized). The interpretation of the intercept from contrast coding also aligns with 
how intercepts would be interpreted if there were no categorical variables in the model 
and all continuous predictors were standardized (i.e., sample average). Thus, contrast 
coding stands as a reasonable default if researchers are unsure of which coding strategy 
to choose; however, the use of contrast coding should be studied further in a variety of 
contexts to assess its appropriateness as a potential default. 

Another observation our team made during this investigation was that group size 
mattered quite a lot with respect to how much predicted group scores varied across dif- 
ferent coding strategies. In particular, in the COVID stress data, the no education group 
was particularly small (N = 77 out of 10,000 observations). This resulted in two prob- 
lems that merit further investigation. The first is how group size can impact estimates 
and interact with the selection of coding strategy/reference group. Previous research by 
Choi, Park, and Seo (2012) has already shown that variability in the number of groups 
that categorical predictors contain can influence whether lasso or group lasso produces 
better prediction accuracy and recovery of model coefficients. As can be seen in Figure 
1 and Table 7, the estimated means for the no education group in the COVID stress data 
were very unstable and varied more across coding strategies than any other group. Sim- 
ilarly, in Table 12 in Appendix A, we can see that the estimates of all of the Education 


Lasso with Categorical Predictors 35 


group means have the greatest bias when no education is used as the reference group. 
Future research should examine how variability in the sizes of those groups can impact 
the fitting of lasso and group lasso models 

A second issue brought up by having small groups is the difficulty of splitting test 
and training data sets. This may become particularly problematic when there are many 
categorical variables that include many groups. Previous researchers have resolved to 
combine groups that are particularly small (e.g., racial/ethnic minorities; Webb et al., 
2019). It is unclear how this practice impacts estimates for these groups, however, and in 
general combining groups is actively discouraged for other analytic methods (Tarantola 
& Dellaportas, 2005). Methods for splitting the data such as block randomization may 
provide more accurate predictions for small groups if the groups can be evenly split 
across the training and test sets. 


4.3 Conclusion 


Overall, our findings suggest that researchers should be cautious and purposeful 
about selecting their coding strategies when using lasso or group lasso. These choices 
will impact both variable selection and prediction accuracy when using lasso and pre- 
diction accuracy when using group lasso. However, just because variable selection is 
not impacted in group lasso does not mean this method should always be preferred. In a 
simulation study, we demonstrated cases where group lasso may have lower prediction 
accuracy than lasso, particularly when there is a dominant group (one group that dif- 
fers from all other groups). The choices of which method to use (lasso or group lasso), 
what coding strategy to use, and which group order/reference category to use should 
depend on the researcher’s priorities. How categorical variables are represented in lasso 
or group lasso models must be transparently reported to maximize reproducibility and 
replicability. Future research should explore specific practices in this area such as pe- 
nalization of the intercept, the use of contrast coding, and how small groups should be 
accounted for to optimize prediction accuracy for these groups. 

Behavioral scientists are quickly adopting useful tools developed in statistics and 
computer science which fit under the broad area of machine learning and artificial intel- 
ligence. The use of these tools will likely improve the ability of behavioral researchers 
to predict out-of-sample data, which may be particularly important in clinical settings 
and precision medicine. However, it is important to acknowledge that these new tools do 
not necessarily perform in the same ways that many researchers expect based on their 
training, which is primarily in linear regression and ANOVA frameworks (Aiken, West, 
& Millsap, 2008). Ensuring that the differences between these more traditional statis- 
tical frameworks and the newly developed machine learning frameworks are clearly 
defined will improve the implementation of these new methods throughout the field of 
behavioral science. 
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Appendix A Different reference categories 


In addition to the analyses presented in the primary manuscript, we also examined 
how variable selection and prediction accuracy in lasso and group lasso models differ 
across choices within a specific coding strategy. These choices include reference cate- 
gories (dummy and contrast coding) and the order of categories (sequential and Helmert 
coding). We tested whether the category chosen as the reference category in the dummy 
coding strategy matters for variable selection and prediction accuracy. Consider, for ex- 
ample, the dominant group case where all groups have the same mean except one group. 
If that one group is selected as the reference category, then all k — 1 predictors should be 
selected into the model, because all other groups are different from the reference. If any 
other group is selected as the reference group, then only 1 predictor should be selected 
into the model (the indicator for the difference between the one deviant group and the 
reference). While the pattern of means is not different, the reference group may have a 
large impact on the size of the coefficients and the number of non-zero coefficients. 

We fit lasso and group lasso models with all six dummy-coded categorical variables 
and seven continuous variables using the COVID stress data. To explore how choices of 
reference categories affect estimated coefficients, we fit seven models for each regres- 
sion method with differences only in their choices of reference categories in the variable 
Education. The reference categories were chosen and fixed for all other categorical vari- 
ables. Therefore, the differences between these models can only be attributed to the dif- 
ferent choices of the reference category of the variable Education. While this example 
uses dummy coding, we believe the results would generalize to other coding strategies 
(e.g., choice of the reference group for contrast coding, order of groups for Helmert and 
sequential coding). 


Appendix A.1 Variable Selection 


Table 11 shows the coefficients of indicators for Education. The size of the coeffi- 
cients varies depending on which group is the reference, which could pose a problem 
for lasso regression because coefficients and the penalty parameter decide whether the 
variable will be selected into the model, according to Equation 5. Different coefficients 
are not necessarily a problem by themselves; however, these results demonstrate cer- 
tain asymmetries that are concerning. When coefficients vary from model to model, 
the variable selection can differ. For example, when “none” was the reference category, 
the college category was not selected into the model (i.e., the none and college cate- 
gories are assumed to be equal). However, when “college” was chosen as the reference 
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category, the none category was selected into the model (i.e., the none and college cat- 
egories are treated differently). This marks a particularly concerning lack of symmetry 
between these lasso models. 


Table 11: Model Coefficients for Different Reference Categories by Lasso 
Reference Category 


Variables None 6years 9years 12 years Some college College PhD/Doctorate 
Intercept 2.637 2.574 2.649 2.668 2.657 2.641 2.649 
None : -0.013 -0.076 -0.092 -0.083 -0.068 -0.075 

6 years -0.068 ; -0.079 -0.095 -0.086 -0.071 -0.078 

9 years 0.010 0.065 : -0.006 0 0.009 0.002 

12 years 0.033 0.084 0.024 ; 0.017 0.032 0.024 
Some college 0.017 0.067 0.008 -0.006 ; 0.016 0.008 
College 0 0.049 -0.008 -0.024 -0.015 : -0.008 
PhD/Doctorate 0.005 0.057 0 -0.015 -0.006 0.005 


Note. Each column represents one model, and each row represents the coefficients for Education 
produced by each model.“.” is the reference category for the corresponding model, and 0 means 
that lasso does not select the corresponding predictor to be included in the model. 


Group lasso models included all categories within the variable Education when dif- 
ferent categories were chosen as the reference categories, meaning that all categories 
were treated as different in all group lasso models. Group lasso ensures stable perfor- 
mance of variable selection across reference categories. 

We also explored the effect of different reference categories in education on other 
predictors and found that choosing different reference categories affects the coefficients 
and variable selection of other predictors (categorical and continuous) in lasso models. 
Group lasso models, on the other hand, still performed consistent variable selection for 
predictors that did not have their reference categories changed. In our case, group lasso 
models always included all categories within the other five categorical predictors and 
all seven continuous predictors. 


Appendix A.2 Prediction Accuracy 


We examined the prediction accuracy from two aspects: predicted category scores 
and model fit, varying the reference group used in dummy coding education. 


Predicted Category Scores Predicted values for each category were different in both 
lasso and group lasso models from Tables 12 and 13. For the no education category, 
lasso models with different reference categories predicted different values, ranging from 
2.982 to 2.915. Group lasso models also predicted different values for the no education 
category, ranging from 2.983 to 2.991. This indicates that with different choices of 
reference categories, predicted values vary from model to model for both lasso and 
group lasso. 
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Table 12: Predicted Category Means and Prediction Accuracy for Different Reference 
Categories by Lasso 


Reference Category 


Category None 6years 9years 12 years Some college College PhD/Doctorate 
None 2.982 2.920 2.915 2.915 2.915 2.915 2.915 

6 years 2.914 2.933 2.912 2.912 2.912 2.912 2.912 

9 years 2.992 2.998 2.990 3.001 2.998 2.992 2.992 

12 years 3.015 3.017 3.014 3.006 3.014 3.014 3.014 
Some college 2.999 3.001 2.998 3.000 2.998 2.998 2.998 
College 2.982 2.983 2.982 2.982 2.982 2.983 2.982 
PhD/Doctorate 2.988 2.990 2.990 2.991 2.991 2.987 2.990 
MSE 0.13669 0.13678 0.13674 0.13684 0.13675 0.13674 0.13674 


Note. Each column represents one model, and each row (besides the last) represents the 
predicted category means for Education produced by each model (with all continuous predictors 
set to their means and all other categorical variables set to their modes). The last row contains 
the MSE of the corresponding model. 


Table 13: Predicted Category Means and Prediction Accuracy for Different Reference 
Categories by Group Lasso 


Reference Category 


Category None 6years 9years 12 years Some college College PhD/Doctorate 
None 2.986 2.986 2.986 2.990 2.991 2.983 2.987 

6 years 2.971 2.978 2.974 2.983 2.981 2.971 2.975 

9 years 2.988 2.987 2.968 2.979 2.975 2.965 2.969 

12 years 3.002 3.001 3.004 2.991 2.992 2.985 2.988 
Some college 2.996 2.996 2.997 2.996 3.004 3.003 3.004 
College 2.983 2.983 2.983 2.983 2.983 2.996 2.997 
PhD/Doctorate 2.988 2.988 2.988 2.990 2.990 2.987 2.983 
MSE 0.13711 0.13719 0.13709 0.13727 0.13709 0.13708 0.13710 


Note. Same as Table 12 


Figure 4 visualizes the shrinkage effect when different reference categories were 
chosen in lasso models using Education to predict Stress. In this case, the intercept is 
the predicted category mean of each model’s reference category because models are 
coded by dummy coding strategies. Similar to Figure 1, we can conclude that recreated 
category scores shrink towards the reference value for dummy coding. 


Model Fit Model fit, measured by MSE, for both lasso and group lasso models are 
shown in Table 12 and 13. MSEs were generally different across reference categories. 
Note that MSEs in Table 12 and 13 were rounded. Although some MSEs were very 
close to each other and were rounded to the same value, they were not exactly the same, 
which would be the case if linear regression was used. 
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Figure 4: Graphical Presentation of Category Means for Education Recreated by Lasso 
Models with Different Reference Categories. Intercept values are different across refer- 
ence categories. In dummy coding, the intercept value is the estimated category mean 
of the corresponding reference category. 


Appendix B Singular Design Matrices 


STATA is a commonly used statistical software that can implement lasso regression, 
and in STATA categorical predictors are handled by including a singular design matrix 
StataCorp (2019). In this section, we examine this alternative method for creating the 
design matrices for categorical variables. When we introduced categorical variables, we 
noted that for a variable with k categories, k — | indicators are created for this variable. 
Different coding strategies use different matrices to represent the k — | indicators and 
model coefficients represent differences between categories and the reference value, as 
this is common practice for linear regression. The researcher must then choose the ref- 
erence category for analysis. However, there is another way to create the design matrix 
for categorical predictors where the researcher does not need to explicitly choose the 
reference category. Instead of using k — 1 indicators for a categorical variable with k 
categories, we use k indicators. This design matrix allows lasso or group lasso to essen- 
tially select the reference values. Mathematically, this type of design matrix is defined 
as singular, because the matrix is not invertible. Singular design matrices cannot be 
used for linear regression, but lasso and group lasso regression can accommodate sin- 
gular design matrices, making this a unique potential solution to the variable selection 
and prediction accuracy issue related to categorical variables in lasso and group lasso. 

Can singular design matrices solve the inconsistency in lasso’s variable selection 
and prediction accuracy or group lasso’s prediction accuracy across coding strategies? 
To create singular design matrices, we appended a linearly independent column with 
only 1 in the first row to the matrices in Table 1 and 3, and a linearly independent col- 
umn with only | in the last row to matrices in Table 2 and 4. If using singular design 
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matrices solves the issues of variable selection and prediction accuracy, these two prop- 
erties should be equivalent across these four design matrices. To test this, we used the 
same data set and applied the same process as before to fit lasso and group lasso models 
with Education serving as the only predictor variable. Table 14 shows the coefficients 
of the categorical variable Education in lasso models as an example of lasso’s vari- 
able selection. Using a singular design matrix for categorical variables, different coding 
strategies still lead to different lasso model’s variable selection. Contrastingly, group 
lasso selected all categories and performed the same variable selection. For example, 
the contrast-coded lasso model treated the 9 years of education and PhD categories 
as the same, while these two categories were always treated as different in the other 
three lasso models and the four group lasso models. In addition, lasso and group lasso 
models using different coding strategies led to different prediction accuracies, shown 
in Table 15 and 16. This means that using singular design matrices does not solve the 
inconsistent variable selection or prediction accuracy for lasso, nor does it solve the 
inconsistency in prediction accuracy for group lasso. There are infinitely many singular 
design matrices that could be used, and if they all result in different solutions, this does 
not provide strong evidence that the identity matrix system used by StataCorp (2019) 
would perform optimally. 


Table 14: Model Coefficients Using Singular Design Matrix with Lasso 


Coding strategies Dummy Contrast Sequential Helmert 
Intercept 2.991 2.961 2.873 2.961 
1. no -0.109 -0.081 0.015 0.103 
2. 6 years -0.086 -0.063 0.076 0.095 
3. 9 years -0.006 0 0.034 0.023 
4. 12 years 0 0.034 0.010 0 

5. some college 0.016 0.051 -0.017 -0.017 
6. college degree 0 0.028 0 0 

7. PhD 0 0 -0.009 0 


Note. Each column represents one model, and each row represents the coefficient for an 
indicator of Education produced by the corresponding model. A 0 means that lasso does not 
select the corresponding category into the model. 
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Table 15: Predicted Category Means and Prediction Accuracy for Different Coding 
Strategies using Singular Design Matrices with Lasso 


Coding Strategy 

Category Dummy Contrast Sequential Helmert Observed Mean 
None 2.882 2.881 2.864 2.873 2.852 

6 years 2.905 2.898 2.888 2.897 2.883 

9 years 2.986 2.961 2.964 2.973 2.962 

12 years 2.991 2.995 2.999 2.997 2.997 
Some college 3.007 3.012 3.008 3.008 3.013 
College 2.991 2.990 2.991 2.991 2.990 
PhD/Doctorate 2.991 2.992 2.991 2.991 2.991 
MSE 0.15630 0.15620 0.15616 0.15621 / 


Note. Each column (besides the last) represents one model, and each row (besides the last) 
represents the predicted category means for Education produced by each model. The last 
column contains the category means observed in the training data set. The last row contains the 
MSE of the corresponding model. 


Table 16: Predicted Category Means and Prediction Accuracy for Different Coding 
Strategies Using Singular Design Matrices with Group Lasso 


Coding Strategy 

Category Dummy Contrast Sequential Helmert Observed Mean 
None 2.873 2.869 2.878 2.871 2.852 

6 years 2.892 2.890 2.909 2.891 2.883 

9 years 2.962 2.961 2.955 2.962 2.962 

12 years 2.996 2.996 2.994 2.996 2.997 
Some college 3.012 3.012 3.010 3.012 3.013 
College 2.990 2.990 2.991 2.990 2.990 
PhD/Doctorate 2.990 2.991 2.991 2.990 2.991 
MSE 0.15619 0.15619 0.15622 0.15619 / 


Note. Same as Table 15 
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Abstract. Latent growth curve models (LGCM) are widely used in lon- 
gitudinal data analysis, and robust methods can be used to model error 
distributions for non-normal data. This tutorial introduces how to model 
linear, non-linear, and quadratic growth curve models under the Bayesian 
framework and uses examples to illustrate how to model errors using t, 
exponential power, and skew-normal distributions. The code of JAGS 
models is provided and implemented by the R package runjags. Model 
diagnostics and comparisons are briefly discussed. 
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1 Introduction 


Latent growth curve models (LGCM) are widely used in longitudinal studies, 
and LGCM performs well in the identification of intraindividual changes and 
investigation of interindividual differences in intraindividual changes (McArdle 
& Nesselroade, 2014). LGCM can estimate linear and nonlinear growth trajecto- 
ries flexibly or freely estimate the shape of growth trajectory by observed data. 
Researchers may employ either the maximum likelihood estimation method or 
the Bayesian method to model LGSM. The Bayesian methods have advantages 
on handling difficulties in longitudinal data such as unequally spaced measure- 
ments, nonlinear trajectories, non-normally distributed data, and small sample 
sizes (Curran, Obeidat, & Losardo, 2010). 

Influential outliers and non-normally distributed data can lead unreliable 
estimates and inferences. Conventional methods such as deleting outliers may 
result in underestimated standard errors (Lange, Little, & Taylor, 1989). Robust 
statistical modeling methods have been developed to handle the violation of the 
normality assumption. For example, the t-distribution is more robust to outliers, 
and using t-distribution to model errors is one of the robust modeling strategies 
(Lange et al., 1989). Robust modeling using ¢-distributions is easy to understand 
and applied in both maximum likelihood and Bayesian methods (Lange et al., 
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1989; Zhang, 2016; Zhang, Lai, Lu, & Tong, 2013). The degree of freedom of t- 
distributions can be estimated or predetermined, and a large degree of freedom 
means the ¢-distribution approaches a normal distribution (Zhang et al., 2013). 
Based on simulation studies, the robust method using the ¢-distributions for the 
error term demonstrates good performance for heavy-tailed data in growth curve 
models, and it efficiently estimates the standard error (Zhang, 2016; Zhang et 
al., 2013). 

This tutorial aims to present how to implement robust Bayesian growth curve 
models using R and the JAGS programs. To begin, it provides a brief introduction 
to LGCM, including the latent basis growth curve models (LBGM), the linear 
growth curve models, and quadratic growth curve models. Then it introduces 
commonly used priors and convergence diagnostic methods. Finally, a real data 
set is used to demonstrate how to implement robust LGCM, and how to interpret 
the estimated parameters. 


2 Models and notations 


2.1 General latent growth curve models 


Latent basis growth curve models A LGCM with one variable Y can be 
written as: 


Y; =T+ Ab; +E; (1) 


Y; is a T x1 vector in which T is the total number of measurement occasions, 
and A is a T x q factor loading matrix, and it decides the shape of the growth 
trajectory. The €; is assumed to follow a q-variate normal distribution €; ~ 
MN(0, ®). b; is an q x 1 vector and it represents the latent variables used to 
describe the change. 3 is a q x 1 vector that represents the fixed effect (the 
means of b;) and e; is the individual deviation from the fixed effect 3. u; follows 
a multivariate normal distribution with q dimensions as u; ~ MN(0, W). 

LBGM is a special case of the general LGCM. It assumes the error variance 
is the same for all measurements (homogeneity) by simplifying @ = Io?. And 
it also assumes measurement errors are uncorrelated. The parameters in LBGM 
are: 


1 0 
1 1 
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LBGM contains two latent variables: by and bg. br represents the intercept 
and bs represents the growth slope. The specification of factor loadings of bg 
determines the shape of the growth curve. Here, the first and second-factor 
loadings on bg are fixed at 0 and 1 for identification purposes, while other factor 
loadings are freely estimated. This assumption implies that the growth unit is 
the difference between the first two measurements. Another common practice is 
to fix the first and last factor loadings at 0 and 1, respectively, with the unit 
representing the difference between the first and last measurements. 8; and s 
represent the average intercept and slope across all individuals, respectively. o? 
and oe represent variances, reflecting the individual differences in intercept and 
slope. ozs represents the covariance between the intercept and slope. 


The linear growth curve model The specification of A decides the shape 
of growth. When the factor loadings of bs are equally spaced, it becomes a 
linear growth curve model. A linear growth curve model assumes a linear change 
pattern and the slope bg represents a linear slope. The factor loading matrix is: 


1 0 
1 1 
esd. 2 
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The quadratic growth curve model The quadratic growth curve model 
estimates a nonlinear change by including the quadratic slope big, and the model 
can be presented as: 


1 0 0 
1 1 1 
bit 
2 
A=|1 2 2 bi = | bis |, 
> å : big 
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2.2 Robust growth curve models 


The general LGCM assumes €; follow a multivariate normal distribution (€; ~ 
MN(0,®)), while robust growth curve models use other distributions to for €;. 
Zhang (2016) presented and summarized how to use Student’s t, exponential 
power, and the skew normal distributions to build robust LGCM. 
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Student’s t-distribution The robust growth curve models can be specified 
by modeling €; by a student’s t-distribution: €; ~ MT7(0,®,k), where k is the 
degrees of freedom. The robust LGCM with a Students’ t-distribution performs 
better than the traditional growth curve model with a multivariate normal dis- 
tribution when dealing with heavy-tailed data and outliers (Zhang, 2016; Zhang 
et al., 2013). 

The multivariate ¢-distribution approaches the multivariate normal distribu- 
tion when k increases. In the robust Bayesian methods, k can be specified as an 
unknown parameter, and a prior is needed to estimate k. Alternatively, it can 
be fixed and some researchers suggested k = 5 (Zhang et al., 2013). 

In JAGS, t-distribution can be specified using the function dt (), and this 
function will be explained in the following section with an example. 


Exponential power distribution The exponential power distribution can 
model error term e; with smaller kurtosis than normal distributions, and we 
employ the same form of density function and parameters as Zhang (2016) in 
this tutorial. The density of exponential power distribution is as follows: 


Pep(&) = w(y)o exp [enie] 


where 
so) = LBU +92)? 
+y Ea +2 
and 
ay E ETE 
Ti +7)/2] 


Here u and o are location and scale parameters, respectively, and y is a shape 
parameter that can be estimated. 


Skew normal distribution Both the t-distribution and the exponential power 
distribution are symmetric, while the skew normal distribution offers an option 
to model asymmetric errors. The density function of a skew normal distribution 


is as follows: 
2 z— pu GB p 
w w w 


where u is a location parameter, w is a scale parameter, and a is a shape pa- 
rameter which can be estimated. 


3 Robust growth curve model using JAGS 


The following part introduces how to build and interpret the robust LGCM in 
JAGS using a real data set, assuming homogeneity across time points. 
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3.1 Specification of priors 


Priors of LGCM are usually specified as: 8 ~ N (uo, 02), B ~ W(V,m), o? ~ 
IG(a,ß) (assuming ® = Irxro?2). For the robust growth curve model with the 
t-distribution, the degrees of freedom k is another unknown parameter, and an 
uninformative prior is applied to k as follows: k ~ U(1,500). In the case of the 
exponential power distribution which involves an additional shape parameter y, 
an uninformative prior is assigned to it as follows: y ~ U(—1,1). Similarly, for 
the shape parameter a in the skew normal distribution, the prior is specified as 
a ~ U(—5,5). 


3.2 Convergence diagnostic 


To check convergence, trace plots are visually inspected. If trace plots indicate 
non-convergence, then more iterations and longer burn-in periods are needed. 
The length of the chain should be extended until trace plots of all parameters 
demonstrate visual convergence. 

In addition to visual inspection, various convergence diagnostic tools are 
available in R, including the Geweke test (Geweke, 1992), the Heidelberger and 
Welch test (Heidelberger & Welch, 1983), Gelman and Rubin test (Gelman & 
Rubin, 1992), and the Raftery and Lewis diagnostic (Raftery, Lewis, et al., 1992). 
In this tutorial, the Geweke diagnostic is used, which compares the mean differ- 
ence between two parts of chains, typically the first and last parts. It employs a 
z test to compare the means of two parts, and if the z test statistic rejects the 
null hypothesis, it indicates a significant difference. 


3.3 Autocorrelation and posterior distribution 


The adjacent iterations of the Markov chain may exhibit high dependence, and 
serious autocorrelation can indicate problems in model estimation such as a prob- 
lem with the sampling algorithm. The autocorrelation problem can be identified 
by visual inspections. If visual inspection shows high autocorrelation, increasing 
the number of iterations or implementing thinning techniques can be beneficial. 
Additionally, it is important to ensure that the posterior distribution makes 
substantive sense, taking into account factors such as the parameter’s range 
and standard deviation. For instance, it would be unreasonable if the posterior 
standard deviation exceeds the parameter’s scale. 


4 Examples 


This section includes R code and JAGS commands for constructing robust growth 
curve models. The ¢-distribution is offered by JAGS and can be directly imple- 
mented. In the following parts, t-distribution is utilized to model and compare 
LBGC, linear and quadratic LGCM. To illustrate different robust methods, we 
specify linear LGC models using the t, exponential power and skew-normal dis- 
tributions. 
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The data used in this tutorial were obtained from the Early Childhood Lon- 
gitudinal Study, Kindergarten Class of 2010-11 (ECLS-K:2011), a national lon- 
gitudinal program conducted by the National Center for Education Statistics. 
ECLS-K:2011 collected information about children’s development during their 
elementary school years. For this tutorial, a random subset of data consist- 
ing of N = 200 samples was selected from ECLS-K:2011. This subset includes 
math scores measured at four different occasions. Math ability assessments were 
conducted annually, spanning from the second grade to the fifth grade. De- 
tailed information about ECLS-K:2011 can be found in the manual provided by 
Tourangeau et al. (2015). 

Descriptive analysis revealed that the distributions of the observed math 
scores were skewed and exhibited heavier tails than normal distributions, as de- 
picted in Figure 1. Additionally, increasing trends in math scores were observed, 
and the growth pattern of each individual is illustrated in Figure 2. 
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Figure 1. Descriptive plots of math scores 


4.1 Specify the JAGS models 


t distribution The LBGM model is specified using the JAGS notations as: 


# models 


math 


2- 
0- 
—2- 
math1 math2 math3 
time 
Figure 2. Growth curves of math scores in four waves 
modell <- "model { 
# Specify the likelihood 
for (i in l:nsubj) { 
for (j in l:ntime) { 
t error 
yi, 3] dt (mu[i, j], tauy, df) 
normal 
yi, 3] dnorm(mu[i, j],tauy) 
} 
} 
for (i in l:nsubj) { 
mu[i,1] <- b[i,1] 
mu[i,2] <- b[i,1]+b[i,2] 
mu[i,3] <- b[i,1]+A3*b[i,2] 
mu[i,4] <- b[i,1]+A4*b[i,2] 
b[i,1:2] ~ dmnorm(mub[1:2], taub[1:2,1:2]) 


} 
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# Specify the growth trajectory 
A3~dnorm(0,1.0E-6) 
A4~dnorm(0,1.0E-6) 
# specify priors 


mub[1]~dnorm(0,1.0 
mub[2]~dnorm(0,1.0 


Ei A 


math4 
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taub[1:2, 1:2] ~ dwish(Omega[1:2, 1:2], 2) 
sigma2b[1:2, 1:2] <- inverse(taub[1:2, 1:2]) 
tauy ~ dgamma(0.001,0.001) 

sigma2y <- 1 / tauy 

df ~ dunif (1,500) 


Omega[1,1] <- 1 
Omega[2,2] <- 1 
Omega[1,2] <- Omega[2,1] 
Omega [2,1] <- 0 


An R object mode11 is constructed using the model block. In the case of a 
four-wave of data organized with 200 rows (N = 200) and 4 columns (T = 4), 
we use for loops to specify the likelihood for all participants across the four 
measurement occasions. 

This likelihood reflects the use of a robust Bayesian method. Specifically, 
y[i, j] is modeled using univariate t-distributions, which are defined by dt () 
with parameters for means mu[i, j], precision tauy, and degrees of free- 
dom df. If the data were modeled using a multivariate normal distribution, 
i.e., y[i,j] ~ dnorm(mu[i,j], tauy), the model would represent a tra- 
ditional latent growth curve model with normal assumptions. 

The next part of the model involves specifying the prior distributions. @ is as- 
sumed to follow a bivariate normally distribution with 8 ~ MN((0,0)", 1000J2), 
and the covariance of e; follows an inverse Wishart distribution (Zhang, 2021). 
The error term e€; is assumed to follow a t-distribution with an estimated k 
(ei ~ MTr(0,®,k)). Here, a uniform distribution Uni f (1,500) is used as the 
prior of k. 

The latent variables and means are specified based on the hypothesized 
growth curve and priors. The parameter b[i,1] represents the latent inter- 
cept of LGCM, and the latent slope is b[i, 2]. A3 and A4 are factor loadings of 
math scores at the third and fourth measurement occasions, which control the 
shape of changes. 

If A3 is set to 2 and A4 is set to 3, then the model becomes a linear growth 
curve model. Quadratic LGCM involves three latent variables, within b[i, 3] 
representing the quadratic shape. The coefficients A5 and A6 are fixed at 4 and 
9, respectively. 

Detailed JAGS models for both the linear and quadratic growth curve models 
can be found in the appendix. 


Exponential power distribution JAGS does not offer exponential power dis- 
tribution or the skew-normal distribution by default. However, the likelihood 
can be specified indirectly using the Bernoulli or the Poisson distributions (Nt- 
zoufras, 2011). 

One approach, known as the “zero trick,” utilizes the Poisson distribution. 
A matrix with the same dimensions of the data is created, with all elements 


‘ 


Robust Bayesian growth curve modeling 51 


set to zero. The likelihood is reflected in the mean of the Poisson distribution. 
Assuming observation y; follows a new distribution and the log-likelihood is 
l; = log f (y;|@). The model likelihood can be expressed as: 


n —(-litc) —l; + 0)? 2 
folo) =T] > TU = | [f0 -4 +). 
f i=1 


i=l 


In this expression, the mean of the Poisson distribution is a constant (C) minus 
the log-likelihood (C — l;) and C is chosen to ensure the mean of the Poisson 
distribution is always positive. 

The one trick sets all observations to one and uses the parameter of the 
Bernoulli distribution to specify likelihood. 

In this paper, the zero tricks were used to specify exponential power and 
skew-normal distributions, assuming a linear change trajectory. 

The model code is provided in the appedix. In the code, the log gamma 
function is specified using command loggam(), and dpois () is used to sample 
from the Poisson distribution. 


Skew normal distribution The location parameter of the skew normal dis- 
tribution is reparameterized as 


a 
u = w— y (2/4 
(1 +a?) Ci 
to ensure that the mean of the error is zero. In the code, the standard normal 
cumulative density function is specified by phi () and the log density function 
of the normal distribution is specified by Logdensity.norm(). 


4.2 Specify iterations,initial values, and saved parameters 


After configuring the models, we can proceed by organizing the data in a list, 
specifying initial values, and running the JAGS model. 

The data is organized in a wide format and stored in a list called datalist, 
which includes the number of participants (nsubj) and the number of mea- 
surements (ntime). In this setup, we use two chains (nChains = 2), each 
with a length of 20,000 iterations (nIter = 20000), and a burn-in period of 
10,000 iterations (burnInSteps = 10000). Monitored parameters encompass 
the means and variances of intercepts and slopes, and the shape parameters such 
as the degrees of freedom. These parameters’ posterior draws will be saved. 


# create data set for \texttt{JAGS} model 


nsubj = nrow(data) 
ntime = ncol (data) 
datalist = list (nsubj=nsubj, ntime=ntime, y=data) 


# set parameters, adaption, and MCMC chains 
parameters = c("mub","sigma2b","sigma2y","df","A3", "A4") 
adaptSteps =5000 # Adaptive period 
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burniInSteps = 10000 # Burn-in period 
nChains = 2 # The number of chains 
niter =20000 # The number of kept iterations 


Two chains are used in this tutorial, and two sets of initial values are specified: 


# specify initial values 

inits <- list (list (mub=c(0.7,0.4), 
taub=structure(.Data=c(1,0,0,10), 
.Dim=c(2,2)), 


tauy=10,df=3), 

list (mub=c (0.8,0.5), 
taub=structure(.Data=c(2,0,0,8), 
.Dim=c(2,2)), 


tauy=15,df=5) ) 


4.3 Run JAGS models 


The package runjags is used in this tutorial and the function run. jags () is 
used to read, compile, and run the model, and the model results are saved for 
later analysis. 


# run JAGS model 

set.seed (1234) 

out <- run. jags (model=model, 
monitor=parameters, 
data=datalist, n.chains=2, 
inits=inits, method="simple", 
adapt=adaptSteps, 
burnin = burniInSteps, 
sample=nIter, 
keep.jags.files=TRU 
tempdir=TRUE) 


GI 
` 


4.4 Convergence diagnostic 


For convergence checking, we examine both trace plots and Geweke’s test. A 
visual inspection of the trace plots reveals that all parameters have converged 
after the adaptation and burn-in period. Figure 3 displays the plots of the latent 
intercept and slope in the LBGM. 

If Geweke’s test values exceeded 2, we doubled the number of iterations 
and reran the model. In this particular example, we found no clear evidence of 
non-convergence, however, some models exhibited autocorrelation issues in the 
slope, as shown in the autocorrelation plots. To address this, longer iterations 
or thinning techniques may be employed. 

Additionally, posteriors make practical sense by checking the shape and range 
in the posteriors plots. For example, the range of possible values for math ability 
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Figure 3. Trace, ECDF, posterior and autocorrelation plots of the intercept and the 
slope in LBGM with a t distribution 
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is from -4.0 to 4.0, and the posterior mean of the intercept was close to the mean 
of the observed math score in the second grade. 


# Geweke diagnostic 
geweke. diag (outSmemc) 

# Trace plots and autocorrelation plots 
plot (out) 


4.5 Model comparison 


Results of LBGM, the linear and quadratic LGC models are summarized in 
Table 1. To compare these models, we used the deviance information criterion 
(DIC). When the t distribution was employed, the quadratic LGCM exhibited 
the lowest DIC. 


Table 1. Results for the LBGM, linear and quadratic LGC models with t distribution 


LBGM Linear LGCM Quadratic LGCM 

Mean L U Mean L U Mean L U 
bit 0.73 0.62 0.83 0.76 0.65 0.85 0.74 0.64 0.85 
bis 0.46 0.42 0.51 0.37 0.34 0.39 0.42 0.35 0.48 
bi -0.02 -0.04 0.01 
o? 0.49 0.38 0.59 0.48 0.38 0.59 0.54 0.42 0.66 
OLS -0.05 -0.07 -0.02 -0.04 -0.06 -0.02 -0.11 -0.18 -0.05 
TLQ 0.02 0.00 0.04 
OLS -0.05 -0.07 -0.02 -0.04 -0.06 -0.02 -0.11 -0.18 -0.05 
oe 0.03 0.02 0.03 0.02 0.02 0.03 0.09 0.05 0.13 


TSQ -0.02 -0.04 -0.01 
TLQ 0.02 0.00 0.04 
TSQ -0.02 -0.04 -0.01 
fay 0.01 0.01 0.02 


o? 0.03 0.02 0.04 0.03 0.02 0.04 0.03 0.02 0.04 
k 3.44 2.30 4.76 3.53 2.31 4.94 3.58 2.04 5.42 


A3 1.64 1.51 1.77 2.00 2.00 
A4 2.44 2.26 2.64 3.00 3.00 
A5 4.00 
A6 9.00 
DIC 370.79 374.78 283.36 


Note. k represents the degrees of freedom. L: 2.5% HPD; U: 97.5% HPD. 


The estimated means of the intercept biz from the three models were close. 
The estimated factor loadings in LBGM were 1.64 and 2.44 in LBGM, which 
suggests the estimated growth shape was different from a linear trend. 

The estimated degrees of freedom were smaller than 5 in the three models. 
This aligns with the observation that the observed data had heavier tails than 
the normal distribution, as shown in Figure 1(Tong & Zhang, 2017). Therefore, 
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the estimated degrees of freedom (k) are consistent with descriptive statistics, 
affirming that the robust growth curve models are suitable for handling this 
dataset. 

When dealing with models that use exponential power and skew normal dis- 
tributions, it’s important to interpret the DIC (deviance information criterion) 
values from JAGS with caution. In these models, the DIC is calculated separately 
based on likelihood and posteriors. The deviance, denoted as D(6; y), is defined 
as —2log(p(x|9)). The effective model parameters is defined as pp = D — D, 
and the DIC is calculated as DIC = D+ pp. The model using the skew normal 
distribution exhibited the lowest DIC value, making it the preferred choice over 
the t and exponential power distributions. 


Table 2. Results for linear models with t, exponential power, and skew normal distri- 
butions 


Mean 2.5% HPD 97.5% HPD DIC 
t-distribution 


bir (0.76 0.65 0.85 
bis 0.37 0.34 0.39 

oz 0.48 0.38 0.59 
ors -0.04 -0.06 -0.02 374.78 
ors -0.04 -0.06 -0.02 

oz 0.02 0.02 0.03 

o2 0.03 0.02 0.04 

k 3.53 2.31 4.94 

Exponential power distribution 

bir 0.76 0.66 0.86 

bis 0.37 0.34 0.4 

o? 0.49 0.39 0.6 
ors -0.04 -0.06 -0.02 392.10 
ors -0.04 -0.06 -0.02 

c% 0.02 0.02 0.03 

o 0.07 0.06 0.08 

y 0.91 0.76 1 

Skew normal distribution 

bi 0.74 0.64 0.83 

bis 0.37 0.35 0.4 

o? 0.46 0.36 0.56 
ors -0.04 -0.06 -0.02 360.41 
ors -0.04 -0.06 -0.02 

o% 0.02 0.02 0.03 

oa 0.18 0.15 0.21 


a -4.17 -5 -3.01 


56 R. Li 


4.6 Summary of posteriors 


The posterior means of most parameters were almost the same for linear models 
using t, exponential power, and skew-normal distributions, see Table 2. For the 
linear LGCM with the exponential power error, the estimated shape parameter y 
was 0.91, which suggested a fatter tail of the errors than the normal distribution. 
The estimated a in the model using the skew-normal distribution was -4.17 which 
indicates the distribution was left-skewed. 


5 Summary 


LGCM is widely used in longitudinal studies, and the Bayesian approach can 
be applied to handle complex conditions. Bayesian approaches can handle the 
conditions that data are not normally distributed or the sample size is small. 
The robust Bayesian method offers an operable solution for data with heavy 
tails or outliers. 

This tutorial introduces how to implement robust LGCM with three distri- 
butions in JAGS and R in steps. It also covers the model diagnostics and com- 
parison, and interpretations of posterior estimations. This tutorial offers some 
guidelines for researchers who are interested in robust Bayesian growth curve 
models. 


References 


Curran, P. J., Obeidat, K., & Losardo, D. (2010). Twelve frequently asked ques- 
tions about growth curve modeling. Journal of cognition and development, 
11(2), 121-136. doi: https://doi.org/10.1080/15248371003699969 

Gelman, A., & Rubin, D. B. (1992). Inference from iterative simu- 
lation using multiple sequences. Statistical science, 457-472. doi: 
https://doi.org/10.1214/ss/1177011136 

Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to 
the calculations of posterior moments. Bayesian statistics, 4, 641—649. 

Heidelberger, P., & Welch, P. D. (1983). Simulation run length control in the 
presence of an initial transient. Operations Research, 31(6), 1109-1144. 
doi: https://doi.org/10.1287/opre.31.6.1109 

Lange, K. L., Little, R. J., & Taylor, J. M. (1989). Robust statistical modeling 
using the t distribution. Journal of the American Statistical Association, 
84 (408), 881-896. doi: https: //doi.org/10.2307/2290063 

McArdle, J. J., & Nesselroade, J. R. (2014). Longitudinal data analysis us- 
ing structural equation models. American Psychological Association. doi: 
https: //doi.org/10.1037/14440-000 

Ntzoufras, I. (2011). Bayesian modeling using winbugs. John Wiley & Sons. 
doi: https: //doi.org/10.1080/09332480.2012.685377 

Raftery, A. E., Lewis, S., et al. (1992). How many iterations in the gibbs sampler. 
Bayesian statistics, 4(2), 763-773. 


Robust Bayesian growth curve modeling 57 


Tong, X., & Zhang, Z. (2017). Outlying observation diagnostics in growth 
curve modeling. Multivariate Behavioral Research, 52 (6), 768-788. doi: 
https://doi.org/10.1080/00273171.2017.1374824 

Tourangeau, K., Nord, C., Lê, T., Sorongon, A. G., Hagedorn, M. C., Daly, P., 
& Najarian, M. (2015). Early childhood longitudinal study, kindergarten 
class of 2010-11 (ecls-k: 2011). user’s manual for the ecls-k: 2011 kinder- 
garten data file and electronic codebook, public version. nces 2015-074. 
National Center for Education Statistics. 

Zhang, Z. (2016). Modeling error distributions of growth curve models 
through bayesian methods. Behavior research methods, 48, 427—444. doi: 
https: //doi.org/10.3758/s13428-015-0589-9 

Zhang, Z. (2021). A note on wishart and inverse wishart priors for covari- 
ance matrix. Journal of Behavioral Data Science, 1(2), 119-126. doi: 
https://doi.org/10.35566/jbds/v1n2/p2 

Zhang, Z., Lai, K., Lu, Z., & Tong, X. (2013). Bayesian inference and applica- 
tion of robust growth curve models using student’s t distribution. Struc- 
tural Equation Modeling: A Multidisciplinary Journal, 20(1), 47-78. doi: 
https: / /doi.org/10.1080/10705511.2013.742382 


Appendix A Data 


data=read.csv("example_data.csv",header = T) 
colnames (data)=c("ID",paste("math", rep(1:4),sep='") ) 
data=data[,-1] 


Appendix B Using the t-distribution for error 


# The latent basis growth curve model 
modell <- "model { 
# specify the likelihood 
for (i in l:insubj) { 
for (j in l:intime) { 
# t error 
yli; Jl ~ dt(mu[i, j], tauy, df) 
# normal 
# yli, j] ~ dnorm(mu[i, j],tauy) 


for (i in l:nsubj) { 
mu[i,1] <- b[i,1] 
mu[i,2] <- b[i,1]+b[i,2] 
mu[i,3] <- b[i,1]+A3*b[i,2] 


mu[i,4] <- b[i,1]+A4x«b[i,2] 
b[i,1:2] ~ dmnorm(mub[1:2], taub[1:2,1:2]) 


# specify the growth trajectory 

A3~ dnorm(0,1.0E-6) 

A4~dnorm(0,1.0E-6) 

# specify priors 

mub[1]~dnorm(0,1.0E-6) 
mub[2]~dnorm(0,1.0E-6) 

taub[1:2, 1:2] ~ dwish(Omega[1:2, 1:2], 2) 
sigma2b[1:2, 1:2] <- inverse(taub[1:2, 1:2]) 
tauy ~ dgamma(0.001,0.001) 

sigma2y <- 1 / tauy 


df ~ dunif (1,500) 
Omega[1,1] <- 1 

Omega [2,2] <- 1 
Omega[1,2] <- Omega[2,1] 
Omega [2,1] <- 0 


# write model out 
writeLines (modell, "modell.txt") 


# set parameters, adaption, and MCMC chains 
parameters = c("mub","sigma2b","sigma2y","df", 
"A3","A4","dic")# Specify the estimated parameters 
adaptSteps =10000 # Adaptive period 
burniInSteps = 10000 # Burn-in period 
nChains = 2 

nIter =40000 # The number of kept iterations 


nsubj = nrow(data) 
ntime = ncol (data) 


create data set for JAGS model 
datalist = list (nsubj=nsubj,ntime=ntime, y=as.matrix (data) ) 


specify initial values 

inits <- list (list (mub=c(0.7,0.4), 
taub=structure(.Data = c(1,0,0,10),.Dim=c(2,2)), 
tauy=10,df=3), 

list (mub=c(0.7,0.5), 

taub=structure(.Data = c(2,0,0,8),.Dim=c(2,2)), 
tauy=15,df=5) ) 
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# run jags model 
set.seed (1234) 
out <- run. jags (model=model, 
monitor=parameters, 
data=datalist, n.chains=2, 
inits=inits, method="simple", 
adapt=adaptSteps, 
burnin = burnInSteps, 
sample=nIter, 
keep. jags.files=TRUI 
tempdir=TRUE) 


Ba 


# diagnostic 

geweke.diag (out $mcmc) 

# plots 

trace plots and autocorrelation plots 
plot (out) 

Summarize posterior distributions 
mcmcChain = as.matrix(out$mcmc) 

sum = summary (out$mcmc) 


# The linear LGCM 
model2 <- "model{ 
# likelihood 
for (i in 1:nsubj) { 
for (j in 1:ntime) { 
# t error 
yri; j] ~ dt(mu[i, jJ, tauy, df) 


} 
# growth trajectory 
for (i in l:nsubj) { 


mu[i,1] <- b[i,1] 

mu[i,2] <- b[i,1]+b[i,2] 

mu[i,3] <- b[i,1]+A3*b[i,2] 

mu[i,4] <- b[i,1]+A4xb[i,2] 

b[i,1:2] ~ dmnorm(mub[1:2], taub[1:2,1:2]) 


A3 <- 2 # linear change 
A4 <- 3 
mub[1]~dnorm(0,1.0E-6) 
mub[2]~dnorm(0,1.0E-6) 
taub[1:2, 1:2] ~ dwish(Omega[1:2, 1:2], 2) 
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Sigma2b[1:2, 1:2] <- inverse(taub[1:2, 1:2]) 
tauy ~ dgamma(0.001,0.001) 
sigma2y <- 1 / tauy 


df ~ dunif (1,500) 
Omega[1,1] <- 1 

Omega [2,2] <- 1 
Omega[1,2] <- Omega[2,1] 
Omega [2,1] <- 0 


Quadratic LGCM 

model3 <- "model{ 

likelihood 

for (i in l:nsubj) { 

for (j in 1:ntime) { 

# t error 

ylis j] 7 dt(mu[i, J]; tauy, df) 


} 
# growth trajectory 
for (i in l:nsubj) { 


mu[i,1] <- b[i,1] 

mu[i,2] <- b[i,1]+b[i,2]+b[i,3] 

mu[i,3] <- b[i,1]+A3*b[i,2]+A5x«b[i, 3] 
mu[i,4] <- b[i,1]+A4«b[i,2]+A6*b[i, 3] 
b[i,1:3] ~ dmnorm(mub[1:3], taub[1:3,1:3]) 


linear change 


A3 <= 2 
A4 <- 3 
quadratic change 
A5 <- 4 
A6 <- 9 
mub[1]~dnorm(0, —6) 


1] 0.,.1.:01 
mub[2]~dnorm(0,1.0 
mub[3]~dnorm(0,1.0 
taub[1:3,1:3] ~ dwish(Omega[1:3, 1:3], 3) 
sigma2b[1:3, 1:3] <- inverse(taub[1:3,1:3]) 
tauy ~ dgamma(0.001,0.001) 
sigma2y <- 1 / tauy 
df ~ dunif (1,500) 

Omega[1,1] <- 1 
Omega [2,2] <- 1 


| 


Awe 
| 
Doo 

N 
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Omega[3,3] <- 1 
Omega[1,2] <- Omega[2,1] 
Omega[1,3] <- Omega[3,1] 
Omega[2,3] <- Omega[3,2] 
Omega[2,1] <- 0 
Omega[3,1] <- 0 
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Appendix C Using exponential power distribution for 
error 


A linear LGCM 
model4 <- "model{ 
C <- 100000 
lomega <- 0.5*loggam(3+* (1+gamma) /2) -log (1+gamma) 
-—3/2*xloggam((1+gamma) /2) 
cgamma <- (exp (loggam(3* (1+gamma) /2) ) 
/exp (loggam((1+gamma) /2)))* (1/ (1+gamma) ) 
for (i in l:nsubj) { 
for (j in l:intime) { 
# Exponential power 
zeros[i,j] ~ dpois(zeros.mean[i,j]) 
zeros.mean[i,j] <- C-le[i,j] 
le[i,j] <- lomega-log(sqrt (sigma2y) ) 
-cgammaxabs ((y[i, 3]-mu[i,j]) 
/sqrt (sigma2y) ) * (2/ (1+gamma) ) 


} 
# growth trajectory 
for (i in l:nsubj) { 


mu[i,1] <- b[i,1] 

mu[i,2] <- b[i,1]+b[i,2] 

mu[i,3] <- b[i,1]+A3xb[i,2] 

mu[i,4] <- b[i,1]+A4xb[i,2] 

b[i,1:2] ~ dmnorm(mub[1:2], taub[1:2,1:2]) 
} 
A3 <- 2 linear change 
A4 <- 3 
mub[1]~dnorm(0,1.0E-6) 
mub[2]~dnorm(0,1.0E-6) 
taub[1:2, 1:2] ~ dwish(Omega[1:2, 1:2], 2) 
sigma2b[1:2, 1:2] <- inverse(taub[1:2, 1:2]) 
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Appendix D Using the skew normal distribution 


dgamma (0.001,0.001) 


-1,1) 


<- Omega[2,1] 


R. Li 
tauy ~ 
sigma2y <- 1 / tauy 
gamma ~ dunif ( 
Omega[1,1] <- 1 
Omega [2,2] <- 1 
Omega[1,2] 
Omega [2,1] <- 0 


#Al 
model 


5 <- 


inear LGCM 


"model 


Cc <- 100000 


xi <> -sqrt (sigma2y) « 


{ 


xsqrt (2/3.1415) 


for 
for 


(i in 1l:nsubj) 
(j in l:ntime) { 


{ 


# Exponential power 
dpois (zeros.mean[i 


zeros[i 


rj] 


zeros.mean[i 


efi, jl 
phi ( 


le[i 


} 


rj] 


<- yli 


rj 


T 


] <- C-le[i 


jl-mu[i,j] 


rj] 


): standard normal cdf 


<- log(2) 
+logdensity.norm((e[i 
t+log (phi (alphax(e[i 


rj] 
rj] 


# growth trajectory 


for (i in l:nsubj) { 
mu[i,1] <- b[i,1] 
mu[i,2] <- b[i,1]+b[i,2] 
mu[i,3] <- b[i,1]+A3«b[i,2] 
mu[i,4] <- b[i,1]+A4«b[i,2] 
bane 2) = amie E 14 
} 
A3 <- 2 linear change 
A4 <- 3 
mub[1]~dnorm(0,1.0E-6) 
mub[2]~ ae hes OE-6) 
taub[1:2, 2:2] ~ dwish(Omega[1:2, 1: 
ee 1:2] <- inverse (taub[1 


tau: [lis 1s 2] ) 


rJj]) 


# the log density of x is given by 
-log (sqrt (sigma2y) ) 
-xi)/sqrt(sigma2y), 
-xi)/sqrt (sigma2y))) 


(alpha/sqrt (1+alpha^2)) 


0, 


1) 
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tauy ~ dgamma(0.001,0.001) 
sigma2y <- 1 / tauy 
alpha ~” dunif(-5,5) 


Omega[1,1] <- 1 
Omega [2,2] <- 1 
Omega[1,2] <- Omega[2,1] 
Omega [2,1] <- 0 
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Abstract. Meta-analysis of proportions has been widely adopted across 
various scientific disciplines as a means to estimate the prevalence of phe- 
nomena of interest. However, there is a lack of comprehensive tutorials 
demonstrating the proper execution of such analyses using the R pro- 
gramming language. The objective of this study is to bridge this gap 
and provide an extensive guide to conducting a meta-analysis of pro- 
portions using R. Furthermore, we offer a thorough critical review of 
the methods and tests involved in conducting a meta-analysis of pro- 
portions, highlighting several common practices that may yield biased 
estimations and misleading inferences. We illustrate the meta-analytic 
process in five stages: (1) preparation of the R environment; (2) compu- 
tation of effect sizes; (3) quantification of heterogeneity; (4) visualization 
of heterogeneity with the forest plot and the Baujat plot; and (5) expla- 
nation of heterogeneity with moderator analyses. In the last section of 
the tutorial, we address the misconception of assessing publication bias 
in the context of meta-analysis of proportions. The provided code offers 
readers three options to transform proportional data (e.g., the double 
arcsine method). The tutorial presentation is conceptually oriented and 
formula usage is minimal. We will use a published meta-analysis of pro- 
portions as an example to illustrate the implementation of the R code 
and the interpretation of the results. 


Keywords: Meta-analysis of proportions - Heterogeneity - Meta-regression 
- Double arcsine transformation - Baujat plot 


1 Introduction 


A meta-analysis is a statistical approach that synthesizes quantitative findings 
from multiple studies investigating the same research topic. Its purpose is to 
provide a numerical summary of a particular research area, aiming to inform 
future work in that area. Meta-analyses of proportions are commonly conducted 
across diverse scientific fields, such as medicine (e.g., Gillen, Schuster, Meyer 
Zum Bschenfelde, Friess, & Kleeff, 2010), clinical psychology (e.g., Fusar-Poli et 
al., 2015), epidemiology (e.g., Wu, Long, Lin, & Liu, 2016), and public health 
(e.g., Keithlin, Sargeant, Thomas, & Fazil, 2014), etc. The outcomes derived 


Meta-analyses of Proportions in R 65 


from these studies are often used for decision models (Hunter et al., 2014). 
Each individual study included in a meta-analysis of proportions contributes a 
specific number of “successes” and a corresponding total sample size (Hamza, van 
Houwelingen, & Stijnen, 2008). While the majority of meta-analyses primarily 
focus on effect-size metrics that measure a relationship between a treatment 
group and a control group—such as standardized mean difference and odds ratio— 
the effect-size metric in meta-analyses of proportions is an estimate of the overall 
proportion related to a particular condition or event across all included studies 
(Barendregt, Doi, Lee, Norman, & Vos, 2013). For instance, a meta-analysis 
can be conducted to provide an overall prevalence estimate of homeless veterans 
affected by both post-traumatic stress disorder and substance use disorder. 


The purpose of this tutorial is to provide an introduction to conducting a 
meta-analysis of proportions using the R software (R Core Team, 2022). We dis- 
cuss two distinct benefits of choosing R as your primary meta-analysis tool. First, 
R is freely available open-source software that offers a comprehensive collection 
of R packages, which are extensions developed for specialized applications, in- 
cluding meta-analysis. This remarkable feature provides researchers with diverse 
possibilities and flexibility when it comes to data manipulation and analysis. Two 
widely used R packages for meta-analysis are metafor (Viechtbauer, 2010) and 
meta (Schwarzer, Carpenter, & Riicker, 2015). Second, R offers more convenient 
options for transforming proportional data than other statistical software. The 
two commonly adopted data transformation methods are the logit and the dou- 
ble arcsine transformations (though not transforming data is also appropriate 
under certain circumstances). Both the metafor and meta packages are capable 
of performing these transformations. In contrast, other meta-analysis software 
such as Comprehensive Meta-Analysis (CMA) (Borenstein, Hedges, Higgins, & 
Rothstein, 2005) and MedCalc (Schoonjans, 2017) can only perform one of these 
transformations. Additionally, while CMA and MedCalc automatically trans- 
form data, R allows meta-analysts to make a decision on whether to apply data 
transformation. 


To the best of our knowledge, this is the first tutorial that illustrates the 
implementation of such analyses. The tutorial offers an overview of the funda- 
mental statistical concepts related to meta-analysis of proportions and provides 
hands-on code examples to guide readers through the process in R.! We use a 
dataset from a published meta-analytic study to detail the steps involved. More- 
over, we’ve rigorously tested the code in R and validated it using CMA, ensuring 
identical results from both software. 


Last but not least, this tutorial will explain why common publication bias 
assessment procedures aren’t recommended for meta-analyses of proportions. 


1 Throughout this tutorial, we'll present generic code templates for all transformation 
methods. However, the main text of this tutorial will focus on code examples for the 
logit transformation, given the similarity in coding across all methods. For R code 
related to other transformation methods and their associated datasets, please refer 
to the supplementary files. 
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2 Preparation of the R environment 


2.1 R and RStudio 


The first step is to download R. The base R program can be downloaded for 
free from the Comprehensive R Archive Network (https://cran.r-project.org/). 
R provides a basic graphical user interface (GUI), but we recommend that 
readers use a more productive code editor that interfaces with R, known as 
RStudio (RStudio Team, 2022). This is a development environment built to 
make using R as effective and efficient as possible, which is freely available at 
https: //www.rstudio.com/. It adds much more functionality above and beyond 
R’s bare-bones GUI. 

Once RStudio is successfully installed on your computer and opened, the first 
step is to create a new R Script. To do this, navigate to the “File” menu. Click on 
“File”, and in the dropdown menu, select “New File”, then choose “R Script”. 
A new tab will open in the top-left pane of RStudio, known as the source editor. 
This space is where you’ll write your R code. 


2.2 Setting up the working directory 


To ensure proper organization of your R files and data, it’s crucial to establish 
a working directory for the current R session. A working directory serves as a 
centralized location where you can store all your work, including the R code 
you’ve written and data files (e.g., .csv files) you wish to import into R for 
analysis. To set up a working directory, start by creating a folder named “data” 
in your preferred location on the computer, such as the D drive. After doing so, 
enter the following code into the source editor: 


setwd("D:/data") 


3 Overview of the example data set 


3.1 Illustrative example: Prevalence and epidemiological 
characteristics of congenital cataract (Wu et al., 2016) 


The data set we will use for this tutorial is extracted from a published meta- 
analytic study conducted by Wu et al. (2016). They estimated the prevalence of 
congenital cataracts (CC) and their main epidemiological traits. CC refers to the 
opacity of the lens detected at birth or at an early stage of childhood. It is the 
primary cause of treatable childhood blindness worldwide. Current studies have 
not determined the etiology of this condition. The few large-scale epidemiological 
studies on CC also have limitations: they involve specific regions, limited popu- 
lations, and partial epidemiological variables. Wu et al. (2016) aimed to explore 
its etiology and estimate its population-based prevalence and major epidemio- 
logical characteristics, morphology, associated comorbidities and etiology. The 
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original dataset consists of 27 published studies that were published from 1983 
to 2014, among which 17 contained data on the population-based prevalence of 
CC, 2 were hospital-based studies and 8 were CC-based case reviews. Samples 
investigated in the studies were from different regions of the world, including 
Europe, Asia, the USA, Africa, and Australia. The sample sizes of the included 
studies ranged from 76 to 2,616,439 patients, with a combined total of 8,302,708 
patients. The diagnosed age ranged from 0 to 18 years of age. The proportions 
were transformed using the logit transformation, which is commonly employed 
when dealing with proportional data. This transformation results in a sampling 
distribution that is more normal, with a mean of zero and a standard deviation 
of 1.83. The authors coded five moderators, including world region (China vs. 
the rest of the world), study design (birth cohort vs. other), sample size (less 
vs. more than 100,000), diagnosed age (older vs. younger than 1 year old), and 
research period (before vs. after the year 2000). All of these potential moderators 
are categorical variables. Due to page limits, we will work with only a subset of 
the provided moderating variables, including study design and sample size. 


3.2 Recommended format for organizing data 


Prior to performing a meta-analysis in R, it is important to first organize the 
data properly. Table 1 shows an excerpt of the example dataset. Each row in this 
table represents the data extracted from a primary study included in the current 
meta-analysis. The columns contain variables that will be used to compute effect 
sizes, create plots, and conduct further analyses. 


Table 1. Data from Wu et al. (2016) 


author year authoryear cases total studesg studydesign size samplesize 
Stewart-Brown 1988 Stewart-Brown 1988 7 12853 0 Birth cohort 0 < 100000 
Bermejo 998 Bermejo 1998 71 1124654 0 Birth cohort > 100000 
SanGiovanni 2002 SanGiovanni 2002 73 53639 0 Birth cohort 0 < 100000 
Haargaard 2004 Haargaard 2004 773 2616439 0 Birth cohort > 100000 
Stayte 993 Stayte 1993 4 6687 0 Birth cohort 0 < 100000 
Stoll 997 Stoll 1997 57 212479 0 Birth cohort > 100000 
Rahi 2001 Rahi 2001 248 734000 Others > 100000 
Wirth 2002 Wirth 2002 421 1870000 Others > 100000 
Hu 987 Hu 1987 77 207319 Others > 100000 
Abrahamsson 1999 Abrahamsson 1999 136 377334 Others > 100000 
Bhatti 2003 Bhatti 2003 199 982128 Others > 100000 
Nie 2008 Nie 2008 15 15398 Others 0 < 100000 
Chen 2014 Chen 2014 6 9246 Others 0 < 100000 
Yang 2014 Yang 2014 8 6299 Others 0 < 100000 
Pi 2012 Pi 2012 3 3079 Others 0 < 100000 
Holmes 2003 Holmes 2003 10 33021 Others 0 < 100000 
Halilbasic 2014 Halilbasic 2014 51 38133 Others 0 < 100000 


In this data set, we have separate columns for authors’ names and the year 
of publication, which will be useful when sorting studies according to the year 
of publication in R. Additionally, if we decide to use the forest() function in the 
meta package to create forest plots, we need to create a column that combines 
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both variables. In this case, we label the column as “authoryear”. It’s important 
to note that when importing a data file into R, column names with uppercase 
letters will be converted to lowercase. Therefore, we cannot use uppercase or 
lowercase letters to differentiate between different columns. Moreover, we cannot 
leave a blank space between two words when naming a column. As seen in the 
table, we use “authoryear” instead of “author year”, “studydesign” instead of 
“study design”, and “samplesize” instead of “sample size”. 

The variable “cases” represents the number of the event of interest in the 
sample of each study. By dividing “cases” by “total”, we can obtain the propor- 
tions needed to compute effect sizes, which are labeled as “yi” in R. R will also 
calculate the sampling variance for each “yi” and label them as “vi”. The remain- 
ing variables in the dataset are potential moderators, which will be examined in 
either a subgroup analysis or a meta-regression. For instance, “study design” is 
a potential moderator with two categories or levels: “birth cohort” and “others”. 
We have coded each category as either 1 or 0 in the column labeled “studesg” . 
For continuous moderators, readers can create columns to store continuous val- 
ues, such as the “year” column. This dataset is saved as a comma-separated 
values (.csv) file named “data.csv” and is included in the online supplemental 
materials for this tutorial. To import it into R, ensure the .csv file is stored in 
the working directory. 


4 Computation of effect sizes 


4.1 Fixed-effect and random-effects model 


Before combining effect sizes in a meta-analysis, we need to make a choice be- 
tween two modeling approaches for calculating the summary effect size:? the 
fixed-effect and random-effects model (Hedges & Vevea, 1998; Hunter & Schmidt, 
2000). The fixed-effect model assumes that studies included in a meta-analysis 
are functionally equivalent, sharing a common true effect size. Put differently, 
the true effect size is identical across studies, and any observed variation in ef- 
fect size estimates is solely due to random sampling error within each study, 
known as within-study variance. The random-effects model allows the included 
studies to have true effect sizes that are not identical or “fixed” but follow a 
normal distribution. In other words, the random-effects model accounts for both 
within-study and between-study variances, while the fixed-effect model assumes 
that the between-study variance is zero (i.e., between-study heterogeneity does 
not exist). 

The fixed-effect model applies when participants in the studies are drawn 
from a single common population and undergo the same experimental procedures 
conducted by the same researchers under identical conditions. For instance, a 
series of studies with the same protocol conducted in the same lab and sampling 
from the same population (e.g., school children from the same class) may fit 
the fixed-effect model. However, these conditions rarely hold in reality. In fact, 


2 The “summary effect size” and “overall effect size” are interchangeable terms. 
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the majority of meta-analyses are conducted based on studies collected from the 
literature. In such cases, we can generally assume that the true effect varies from 
study to study. Even when a group of studies focuses on a common topic, they 
are often conducted using different methods (Borenstein, 2019). Consequently, 
the true effect size is assumed to follow a normal distribution under the random- 
effects model. 

An additional limitation of the fixed-effect model is that its conclusions are 
limited to the specific set of studies included in the meta-analysis and cannot 
be generalized to multiple populations. However, most social scientists aim to 
make inferences that extend beyond the selected set of studies in their meta- 
analyses. As a general rule of thumb, the random-effects model will be more 
plausible than the fixed-effect model in most meta-analytic studies because the 
random-effects model allows more generalizable conclusions beyond a specific 
population (Borenstein, 2019; Borenstein, Hedges, Higgins, & Rothstein, 2009). 
However, we discourage the practice of switching to the random-effects model 
from the fixed-effect model based solely on the results of heterogeneity tests. We 
will discuss the reasons in more depth later. 

The random-effects model can be estimated by several methods (although 
other methods exist, we will focus on the most popular ones here): the method 
of moments or the DerSimonian and Laird method (DL; DerSimonian & Laird, 
1986) and the restricted maximum likelihood method (REML; Raudenbush & 
Bryk, 1985). In all cases, the summary effect size (i.e., the summary proportion) 
is estimated as the weighted average of the observed effect sizes extracted from 
primary studies. The weighting for each observed effect size is the inverse of the 
total variance of a study, which is the sum of the within-study variance and 
the between-study variance (Ma, Chu, & Mazumdar, 2016). These two methods 
differ mainly in the estimation of the between-study variance, commonly denoted 
as 7°? in the meta-analytic literature. The technical differences between these 
methods have been summarized elsewhere (e.g., Knapp, Biggerstaff, & Hartung, 
2006; Thorlund, Wetterslev, Awad, Thabane, & Gluud, 2011; Veroniki et al., 
2016) and will not be discussed here. 


4.2 Transformation of proportions: the logit transformation and the 
double arcsine transformation 


When the observed proportions are around 0.5 and the number of studies is 
sufficiently large, the proportions follow an approximately symmetrical bino- 
mial distribution. Under such circumstances, the normal distribution is a good 
approximation of the binomial distribution, and using the raw proportion as 
the effect-size metric for analysis is appropriate (Barendregt et al., 2013; Box, 
Hunter, & Hunter, 2005; Wang & Liu, 2016). Additionally, based on their simula- 
tion study, Lipsey and Wilson (2001) suggested that when observed proportions 
derived from primary studies fall between 0.2 and 0.8, and the focus is solely on 
the mean proportion across the studies, the raw proportion can be adequately 
employed as the effect-size metric. The procedure for calculating the effect size, 
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sampling variance, and inverse variance weight for an individual study using the 
raw proportion is as follows (Lipsey & Wilson, 2001): 
The raw proportion is given by: 


k 
ES, =p == (1) 
with its sampling variance: 
ep PULP) 
Var, = SE, = = (2) 
and the inverse variance weight: 
1 1 n 
Wp (3) 


Var, SE? p(1-p) 


where p is the proportion, k is the number of individuals or cases in the category 
of interest, and n is the sample size. ES, SE, Var, and w stand for effect size, 
standard error, sampling variance, and inverse variance weight, respectively. 

However, when collecting studies for a meta-analysis of proportions, it is 
observed that proportional data are rarely centered around 0.5 and often ex- 
hibit significant skewness (Hunter et al., 2014). As the proportions deviate fur- 
ther from 0.5 and approach closer to the boundaries (particularly when they 
are below 0.2 or above 0.8), they become less likely to be normally distributed 
(Lipsey & Wilson, 2001). Additionally, using the raw proportion as the effect- 
size metric in such situations may underestimate the coverage of the confidence 
interval around the weighted average proportion and overestimate the level of 
heterogeneity among the observed proportions (Lipsey & Wilson, 2001). Conse- 
quently, relying on the assumption of normality may lead to biased estimation 
and potentially misleading or invalid inferences (Feng et al., 2014; Ma et al., 
2016). 

To address the skewness in the distribution of observed proportions, it is 
common practice to apply transformations to the observed proportions collected 
for a meta-analysis. This is done to ensure that the transformed proportions con- 
form as closely as possible to a normal distribution, thus enhancing the validity 
of subsequent statistical analyses (Barendregt et al., 2013). More specifically, all 
computations and analyses are performed based on the transformed proportions 
(e.g., the natural logarithm of the proportion) and their inverted variances (i.e., 
the study weight). The results, such as the summary proportion and its confi- 
dence interval, are presented in the original effect-size metric (i.e., proportion) 
for ease of presentation and interpretation (Borenstein et al., 2009). 

In practice, the approximate likelihood approach (Agresti & Coull, 1998) is 
arguably the predominant framework for modeling proportional data (Hamza et 
al., 2008; Nyaga, Arbyn, & Aerts, 2014). There are two main ways to transform 
observed proportions within this framework: the logit or log odds transforma- 
tion (Sahai & Ageel, 2012) and the Freeman-Tukey double arcsine transforma- 
tion (Freeman & Tukey, 1950; Miller, 1978). For the logit transformation, the 
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observed proportions are first converted to their natural logarithm of the pro- 
portions (i.e., the logit). Following the transformation, the logit transformed 
proportions are assumed to follow a normal distribution, and all analyses are 
conducted on the logit scale. Subsequently, the logits are converted back into 
proportions for reporting and interpretation purposes. The procedure for cal- 
culating the logit, its standard error and inverse variance weight for primary 
studies, as well as the formula for back-transformation, are as follows (Lipsey & 
Wilson, 2001). 
The logit is calculated by: 


ES) = log, (4) =In (4) (4) 


with its sampling variance: 


Varı = SE? = — + —___- 5 

‘np n(1—p) 6) 
and the inverse variance weight: 
1 

Me SE (6) 


To convert the transformed values into proportions, use: 


elogit 
Pp 


= Goat fT g 

Being widely employed in meta-analyses of proportions, the logit transforma- 
tion still has its limitations in certain situations. Two limitations are particularly 
noteworthy. 

First, the issue of variance instability persists even after applying the logit 
transformation (Barendregt et al., 2013; Hamza et al., 2008). The purpose of 
data transformation is to bring the skewed data closer to a normal distribution 
or at least to achieve more consistent variance. While the logit transformation 
generates a sampling distribution that approximates normality to a greater ex- 
tent, it fails to stabilize the variance, potentially placing undue weight on studies. 
According to the equation for sampling variance (Eq. 5), for a fixed value of n, 
the variance changes with p. For instance, consider a situation with two studies of 
the same sample size, where an observed proportion close to 0 or 1 yields grossly 
magnified variance, while an observed proportion around 0.5 yields squeezed 
variance, leading to variance instability (Barendregt et al., 2013). 

Second, when the event of interest is extremely rare (i.e. p = 0) or extremely 
common (i.e., p = 1), the logits and their sampling variances become undefined. 
In practice, the common solution is to add an arbitrary constant 0.5 correction 
to the np and n(1-p) for all studies (Hamza et al., 2008). However, this approach 
has been shown to introduce additional bias to the results (Lin & Xu, 2020; Ma 
et al., 2016). 
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Both of the aforementioned problems can be elegantly solved by employing 
the variance-stabilizing transformation known as the double arcsine transfor- 
mation (Freeman & Tukey, 1950), which is accomplished with the following 


equation?: 
| k k+1 
ES = sin”! mil + sin! aad (8) 


The sampling variance is computed by: 


(9) 


Var; = ——— 
we n+ 0.5 


The back-transformation is computed by the equation as proposed by Miller 
(1978): 


1 
: 273 
1 sint — 1. 
p= 1 — sgn (cost) i (sine | ai sat) | (10) 
where ¢ denotes the double arcsine transformed value or the confidence interval 
around it with sgn being the sign operator. In Eq. (10), the total sample size 
denoted by n’ is calculated as the harmonic mean of individual sample sizes 
(Miller, 1978). The harmonic mean is defined as: 


emd a (11) 


where n; denotes the sample size of each included study and m denotes the 
number of included studies. Miller (1978) gives an example in his paper: a meta- 
analysis of proportions includes four studies with sample sizes being 11, 17, 21, 
and 6, respectively. The harmonic mean of the four sample sizes will be: 

4 


nl = = 10.9885. (12) 
utitats 


Barendregt et al. (2013) found that Eq. (10) becomes numerically unstable 
when sint is close to 0 or 1, leading to potentially misleading results. This 
phenomenon has also been documented by recent publications (Evangelou & 
Veroniki, 2022; Lin & Xu, 2020; Schwarzer, Chemaitelly, Abu-Raddad, & Riicker, 
2019). Instead of the harmonic mean, Barendregt et al. (2013) and Xu et al. 
(2021) recommend using 1/0 as the estimate for the total sample size. They 
propose that the double arcsine back-transformation be implemented as follows: 


il se sint- => 2]? 
pa 1 — sgn(cost) |1— | sint 4 (13) 


aie 


3 The metafor package uses different definitions of Eq.8 and 9. For more details, see 
https: //www.metafor-project.org/doku. php/faq. 
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where p is the pooled proportion on the natural scale and 0 is the pooled variance 
on the transformed scale. Notice that Eq. (13) uses 1/0 instead of the harmonic 
mean. 

In summary, raw proportions are adequate when the observed proportions 
from primary studies fall between 0.2 and 0.8. When observed proportions are 
less than 0.2 or greater than 0.8, the logit or double arcsine transformation 
is recommended. It is worth noting that some simulation studies have shown 
that the double arcsine method slightly outperforms the logit transformation in 
terms of relative bias, mean squared error, and 95% coverage (Barendregt et 
al., 2013; Xu et al., 2021). Furthermore, the double arcsine method would be a 
more appropriate choice when extreme proportions need to be addressed. Last 
but not least, we recommend Eq. (13) when applying the back-transformation 
of the double arcsine method. 


4.3 Calculating the summary effect size in R 


In a meta-analysis, effect sizes are weighted by the inverse of their sampling 
variances, giving greater weight to larger studies and allowing their effect sizes 
to have a greater impact on the overall mean. The weighted average proportion 
(i.e., the summary proportion) can be computed as follows (Barendregt et al., 
2013): 


Pi 


> (wipi) = 3 Varp; 
3 Wi = Varn 


SEp => wi=4/>_ ae (15) 


The confidence interval of the weighted average proportion can be expressed 
as follows: 


ESp=P= (14) 


with its sampling error: 


Py = P — Zaa) (SEp) 
Py = P + Za-a) (SEP) 


where Z(a-a) = 1.96 when a = 0.05. 

We will now proceed with the first step of our meta-analysis. First, readers 
need to install and download the necessary R packages. These packages are devel- 
oped to run within R and contain a collection of functions that are essential for 
conducting meta-analyses. In this tutorial, we will install two packages: metafor 
(Viechtbauer, 2010) and meta (Schwarzer et al., 2015). We will primarily rely on 
metafor and use meta to create forest plots. To install these packages, execute 
the following command: 


(16) 


install.packages(c("metafor", "meta")) 


Once readers have installed a package, it becomes permanently available for 
use in R on this specific computer. To use the installed packages, one needs to 
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execute the library() function each time you run R. To load metafor and meta 
into the current R session, type the following R code: 


library (metafor) 
library (meta) 


We then need to import data.csv into R and create a data frame named 
“dat”. This can be achieved by using the read.csv() function and running the 
following code: 


dat <- read.csv("data.csv", header = TRUE, sep = ",") 


The code above represents a standard approach to importing .csv files. It in- 
structs R to read a .csv file, interpreting the first row as column names, and 
recognizing commas as the separators between values. 

To estimate the weighted average proportion, we will use the following func- 
tions in metafor: escalc(), rma(), and predict(). These functions, in conjunction 
with a range of arguments to be specified within them, provide instructions to 
R on how to calculate effect sizes. Note that certain arguments have default 
values, such as weighted = TRUE, so users don’t need to specify them. The es- 
calc() function estimates an effect size and its standard error for every primary 
study included in a meta-analysis. Users have the flexibility to decide whether 
to transform these effect sizes and, if so, which transformation method to em- 
ploy, by using the measure argument. We will now create a data frame named 
“ies” (short for individual effect size) to store calculated effect sizes and standard 
errors using the following generic code: 


#Only choose one of the three transformation methods 
ies <- escalc(xi = cases, ni = total, data = dat, 
measure = "PR") 


Here, the variable “cases” contains the number of events. The variable “total” 
contains the sample size. We use the argument data to inform R that these 
variables are contained in the data frame “dat”. By using the argument measure, 
we can specify which computational method to employ for transforming the raw 
proportions: 


measure = "PR" #No transformation 
measure = "PLO" #The logit transformation 
measure = "PFT" #The double arcsine transformation 


We will then use the function rma() to pool the derived effect sizes. The 
function will yield a summary proportion, its standard error, and a 95% con- 
fidence interval. Additionally, it will also conduct heterogeneity tests. We can 
execute the following code to achieve this: 


pes <- rma(yi, vi, data = ies, method = "REML") 
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Although naming an object in R is arbitrary, we strongly recommend that 
readers assign meaningful names to objects. In this case, if we decide not to 
perform a transformation, we will name this object “pes”, which stands for 
pooled effect size. If we decide to perform a transformation with either the logit 
or the double arcsine, we will name it “pes.logit” or “pes.da”, which stands for 
logit or double-arcsin transformed pooled effect size, respectively. The object will 
store all of the outcomes. The method argument dictates which of the following 
between-study variance estimators will be used (the default method is REML): 


method = "DL" #The DL estimator 
method = "REML" #The REML estimator 


If unspecified, rma() estimates the variance component using the REML esti- 
mator. Even though rma() stands for random-effects meta-analysis, the function 
can perform a fixed-effect meta-analysis with the code: 


method = "FE" 


The object “pes.logit” or “pes.da” now contains the estimated transformed 
summary proportion. To convert it back to its original, non-transformed scale 
(i.e., proportion) and yield an estimate for the true summary proportion, we can 
use the predict() function: 


#Inverse of logit transformation 

pes <- predict(pes.logit, transf = transf.ilogit) 

#Inverse of double arcsine transformation 

pes <- predict(pes.da, transf = transf.ipft.hm, targ = 
list(ni = dat$total)) 


The argument transf dictates how to convert the transformed proportion 
back to proportion. As mentioned earlier, we can follow two methods for back- 
transformation (Eq. 10 or Eq. 13). In either case, we set the transf argument 
to transf.ipft.hm (the “hm” stands for the harmonic mean). If we opt for the 
harmonic mean (n) in Eq. (10) as the estimate for the total sample size, the 
sample sizes of primary studies are specified by setting the targ argument to 
list(ni = dat§$total). If we opt to use 1/0 as the total sample size estimate, then 
we specify the total sample size as 1/(pes.da$se)? within the targ argument and 
use the following code for back-transformation: 


pes <- predict(pes.da, transf = transf.ipft.hm, targ = 
list (ni=1/(pes.da$se) 2) ) 


Finally, to see the output for the estimated summary proportion and its 95% 
CI, we can use the print() function: 


print (pes) 


For the sake of readers’ convenience, we provide readers with generic code 
for calculating the summary proportion under the random-effects model using 
three different transformation methods: 
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# Option 1: no transformation 


ies <- escalc(xi = cases, ni = total, data = dat, 
measure = "PR") 
pes <- rma(yi, vi, data = ies) 


print (pes) 


# Option 2: the logit transformation 


ies.logit <- escalc(xi = cases, ni = total, data = 
dat, measure = "PLO") 
pes.logit <- rma(yi, vi, data = ies.logit) 


pes <- predict(pes.logit, transf = transf.ilogit) 
print (pes) 


+ 


Option 3: the double arcsine transformation 

# targ can also be set to list(ni = 1/(pes.da$se)^2) 

ies.da <- escalc(xi = cases, ni = total, data = 
dat, measure = "PFT", add = 0) 

pes.da <- rma(yi, vi, data = ies.da) 

pes <- predict(pes.da, transf = transf.ipft.hm, 
targ = list(ni = dat$total)) 

print (pes) 


Note the use of add = 0 in Option 3. When a study contains proportions 
equal to 0, the escalc() function will automatically add 0.5 to the observed data 
(i.e., the “cases” variable). Since the double arcsine transformation does not 
require any adjustments to be made to the data in such a situation, we can 
explicitly switch add = 0.5 to add = 0 to prevent the default adjustment. 

Returning to the running example, we chose Option 2 (i.e., the logit trans- 
formation) to calculate the summary proportion because all of the observed 
proportions in the dataset are far below 0.2: 


ies.logit <- escalc(xi = cases, ni = total, measure = 
"PLO", data = dat) 
pes.logit <- rma(yi, vi, data = ies.logit, method = 


"DL", level = 95) 
pes <- predict(pes.logit, transf = transf.ilogit) 
print(pes, digits = 6) 


The argument digits specifies the number of decimal places to which the 
printed results should be rounded, with the default value being 4. The argument 
level specifies the confidence interval, with the default value set to 95%.* 


4 In this particular case, the estimates of T, T°, and I? will fall outside of the 95% 
CI for unknown reasons (though the summary proportion will not). The original 
authors did not discover this issue. One way to address this issue is by switching to 
the 99% CI. However, for the sake of consistency, we will continue to use the 95% 
CI throughout this tutorial. 
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The estimated summary proportion and its 95% CI are shown in Figure 1. 
Interpreting these summary statistics, we find that the summary proportion is 
estimated to be 0.000424 and its 95% CI is between 0.000316 and 0.000569. 


pred ci.lb ci.ub cr.1b cr.ub 
0.000424 0.000316 0.000569 0.000133 0.001347 


Figure 1. Summary proportion and its 95% CI 


5 Quantification of heterogeneity 


Meta-analysis aims to synthesize studies and estimate a more precise summary 
effect. An important decision that all meta-analysts face is whether it is appropri- 
ate to combine a set of identified studies in a meta-analysis, given the inevitable 
differences in their characteristics to varying degrees. Combining studies with 
substantially different effect estimates can result in an inaccurate summary effect 
and an unwarranted conclusion. For example, in a meta-analysis of proportions 
regarding re-offending rates among juvenile offenders in a city, the summary 
proportion may fall within a medium range (around 0.5). However, considerable 
variation exists among these proportions, with some studies conducted in certain 
boroughs reporting small proportions (e.g., under 0.1), while others report very 
large proportions (e.g., above 0.9). Simply reporting a moderately large mean 
proportion would be misleading, as it fails to acknowledge the significant vari- 
ation or inconsistency in effect sizes across the studies. This variation is known 
as heterogeneity (Del Re, 2015). We will introduce three quantifying statistics 
for heterogeneity in this section: T?, Q, and I?. 


5.1 The between-study variance: 7? 
Heterogeneity can be quantified by dividing it into two distinct components: 
the between-study variance, which arises from the true variation among a body 
of studies, and the within-study variance, resulting from the sampling error. 
The true variation can be attributed to clinical and/or methodological diversity, 
in other words, the systematic differences between studies beyond what would 
be expected by chance, such as experimental designs, measurements, sample 
characteristics, interventions, study settings, and combinations thereof (Lijmer, 
Bossuyt, & Heisterkamp, 2002; Thompson & Higgins, 2002). In this tutorial, we 
focus on the true variation in effect sizes, namely the between-study heterogene- 
ity. 

We characterize between-study heterogeneity by the variance of the true 
effect size underlying the data, 7”, a statistic called tau-squared. Under the 
assumption of normality, 95% of the true effects are expected to fall within + 
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1.96 x T of the point estimate of the summary effect size (Borenstein, Hedges, 
Higgins, & Rothstein, 2010). 7° reflects the total amount of systematic differences 
in effects across studies. The total variance of a study consists of the between- 
and within-study heterogeneity and is used to assign weights under the random- 
effects model (i.e., the inverse of the total variance). 

In classic inverse variance meta-analysis, T? can be estimated by numerous 
methods, as mentioned in Section 4 (e.g., REML, DL). Review and simulation 
studies have shown that both methods perform satisfactorily well across various 
situations; the differences between their results are negligible and rarely signif- 
icant enough to impact the qualitative conclusions (e.g., Hamza et al., 2008; 
Thorlund et al., 2011; Veroniki et al., 2016). Nevertheless, it is advisable to ob- 
tain the 95% confidence interval around the point estimate of r?°, especially when 
the number of included studies is small (less than 5) (Veroniki et al., 2016). 

In practice, the DerSimonian and Laird estimator is arguably the most com- 
monly used statistical method for meta-analyses of proportions and has become 
the conventional and default method for assessing the amount of between-study 
heterogeneity in many software packages, such as CMA (Cornell et al., 2014; 
Schwarzer et al., 2015). All estimations in this tutorial are based on the DL 
method. 


5.2 Test of heterogeneity: Cochran’s Q 


Using formal tests, the presence of between-study heterogeneity is generally ex- 
amined using a x? test with a statistic Q (Cochran, 1954) under the null hy- 
pothesis that all studies share the same true effect (Hedges & Olkin, 1985). In 
other words, the Q-test and its p-value serve as a test of significance to address 
the null hypothesis: Hp : T? = 0. If the value of the Q-statistic is above the 
critical x? value, we will reject the null hypothesis and conclude that the effect 
sizes are heterogeneous. Under such circumstances, you may consider taking the 
random-effects model route. If Q does not exceed this value, then we fail to 
reject the null hypothesis. 

It is important to exercise caution when interpreting a non-significant p-value 
and drawing the conclusion of homogeneous true effects. The statistical power of 
the Q-test heavily relies on the number of studies included in a meta-analysis, 
and as a result, it may fail to detect heterogeneity due to limited power when the 
number of included studies is small (less than 10) or when the included stud- 
ies are of small size (Huedo-Medina, Snchez-Meca, Marn-Martnez, & Botella, 
2006). Therefore, a non-significant result should not be taken as showing empir- 
ical evidence for homogeneity (Hardy & Thompson, 1998). This issue warrants 
serious attention, considering that a significant proportion of meta-analyses in 
Cochrane reviews involve only five or fewer studies (Davey, Turner, Clarke, & 
Higgins, 2011). 

Furthermore, it is important to note that the Q-test, in addition to its afore- 
mentioned limitation, only assesses the viability of the null hypothesis and does 
not provide a quantification of the magnitude of the true heterogeneity in effect 
sizes (Card, 2015). 
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5.3 I? statistic 


Higgins, Thompson, Deeks, and Altman (2003) proposed a statistic for mea- 
suring heterogeneity, denoted as I?, that remains unaffected by the number of 
included studies. In essence, it reflects the ratio of the observed heterogeneity, 
representing the true between-study variance, to the total observed heterogeneity 
(i.e., the sum of between- and within-study variance). As a result, it facilitates 
the comparison of heterogeneity estimates across meta-analyses, regardless of 
the original scale used in the meta-analyses themselves. 

I? can take values from 0% to 100%. A value of 0% indicates that all het- 
erogeneity is caused by sampling error alone, requiring no further explanation. 
Conversely, when I? equals 100%, the entire heterogeneity can be attributed ex- 
clusively to genuine differences between studies, thus justifying the application 
of subgroup analyses or meta-regressions to identify potential moderating fac- 
tors. The thresholds of 25%, 50%, and 75% are commonly used to indicate low, 
medium, and high heterogeneity, respectively (Higgins et al., 2003) . Note that 
these thresholds only serve as tentative benchmarks for I?. The 95% CI around 
the I? statistic should also be calculated (Cuijpers, 2016; Ioannidis, Patsopoulos, 
& Evangelou, 2007). 

Relying solely on the value of J? can be misleading because a 0% I?, ac- 
companied by a 95% CI ranging from 0% to 80%, does not necessarily indicate 
homogeneity in a small meta-analysis study. Rather, the degree of heterogeneity 
remains uncertain in such cases. 


An important caveat 

Together, the Q-statistic, r?, and J? can inform us if the effects are 
homogeneous, or consistent. When the effect sizes are reasonably consistent, 
it is appropriate to combine them and present a summary effect size in 
reports. In cases where moderate and substantial heterogeneity is present, 
the summary effect size becomes less informative or even of no value. In such 
cases, we strongly suggest that researchers conduct moderator analyses to 
thoroughly explore the possible sources of heterogeneity in observed effect 
sizes rather than relying solely on the mechanistic calculation of a single 
mean effect estimate (Egger, Schneider, & Smith, 1998). We will discuss 
moderator analysis in more detail later. 

However, it is important to note that the methods used to estimate the 
amount of heterogeneity and conduct significance tests for heterogeneity 
are not always reliable, potentially leading to misleading interpretations of 
the variability of the true effect size. Relying solely on the Q-test is ill- 
advised due to its inadequate power to detect low heterogeneity (Chung, 
Rabe-Hesketh, & Choi, 2013; Riicker, Schwarzer, Carpenter, & Schumacher, 
2008). Furthermore, the rules of thumb benchmarks for I? only hold true 
when the within-study error is relatively constant (Borenstein, Higgins, 
Hedges, & Rothstein, 2017). Underestimating between-study heterogeneity 
or failing to detect any heterogeneity due to inadequate statistical power 
can result in authors fitting the wrong model (i.e., the fixed-effect model), 
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leading to inaccurate inferences about the overall effect (Higgins & Thomp- 
son, 2002; Thompson, 1994; Thompson & Sharp, 1999). 

Heterogeneity tests provide only a single piece of evidence when deciding 
between the fixed- and random-effects models. The choice of model should 
consider a range of factors, including the sampling frame, the desired type 
of inference, expectations about the distribution of the true effect, and the 
statistical significance of the heterogeneity tests, among others. Borenstein 
(2019) suggested that when studies in a meta-analysis are collected from 
the literature, a random-effects model is almost always preferable. This 
is because the true effect size is likely to vary across studies unless they 
were conducted by the same lab, following identical protocols, and using 
consistent materials on the same population. Furthermore, if we intend 
to make an inference to comparable populations, as is common in social 
sciences, the random-effects model becomes the only appropriate choice. 


5.4 Viewing results of the heterogeneity test and statistics in R 


To view the results of the heterogeneity test (Cochran’s Q) and the estimates of 
between-study variance (7T?) and I?, we still use the print() function: 


# Note, if you selected other transformation methods, 
# then type pes.logit or pes.da in print () 
print (pes) 


The confint() function computes and displays the confidence intervals for 7? 
and I?: 


# If you selected other transformation methods, 
# then type pes.logit or pes.da in confint() 
confint (pes) 


To display the output of heterogeneity-related results for the running exam- 
ple, we can type: 


print(pes.logit, digits = 4) 
confint(pes.logit, digits = 4) 


The output appears in Figure 2. It reveals that 7? is 0.3256 (95% CI = 0.3296, 
1.4997), I? is 97.24% (95% CI = 97.28, 99.39), and the Q-statistic is 580.5387 (p 
<.001), all of which suggests high heterogeneity in the observed proportions.” 


5 Again, the values of 7, T°, and J? have fallen out their 95% CIs. Readers can fix this 
problem by switching to the 99% CI. 
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Random-Effects Model (k = 17; tau*~2 estimator: DL) 


tau~2 (estimated amount of total heterogeneity): 0.3256 (SE 


= 0.2033) 
tau (square root of estimated tau^2 value): 0.5707 
I*2 (total heterogeneity / total variability): 97.24% 


H^2 (total variability / sampling variability): 36.28 


Test for Heterogeneity: 
Q(df = 16) = 580.5387, p-val < .0001 


Model Results: 


estimate se zval pval ci.lb ci.ub 
-7.7650 0.1502 -51.7147 <.0001 -8.0593 -7.4707 *** 


Signif. codes: 0O *** 0.001 ** 0.01 * 0.05 . 0.1 1 
estimate ci.lb ci.ub 

tau*2 0.3256 0.3296 1.4997 

tau 0.5707 0.5741 1.2246 

I72(%) 97.2439 97.2758 99.3884 

H*2 36.2837 36.7079 163.4972 


Figure 2. A random-effects model analysis of heterogeneity 


6 Visualization of heterogeneity 


This section is dedicated to visualization tools and a few formal diagnostic tests 
pivotal for heterogeneity analyses. We introduce two essential tools for readers: 
the forest plot and the Baujat plot. The forest plot allows for a visual assessment 
of the homogeneity across studies, while the Baujat plot can pinpoint studies 
that exert a significant impact on the overall effect, heterogeneity, or both. It’s 
crucial to introduce the forest plot at this point. It lays the foundation for our 
in-depth demonstration of its application in subgroup analyses, which we will 
discuss in Section 7. 


6.1 Forest plots 


A forest plot (as shown in Figure 3) is a graphical representation that effectively 
displays the point estimates of study effects along with their corresponding confi- 
dence intervals (Lewis & Clarke, 2001). It is composed of a vertical reference line, 
an x-axis, and graphical representations of effect size estimates and their 95% 
CIs. The x-axis of the forest plot represents the scale of the outcome measure 
(in our case, the proportion) and can range from 0 to 1. 
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Typically, the vertical reference line is positioned at the point estimate of the 
pooled proportion. At the bottom of the reference line lies a colored diamond 
shape with its length representing the 95% confidence interval of the pooled 
proportion. Each study effect plotted in a forest plot consists of two components: 
a colored square symbolizing the point estimate of the study effect size and a 
horizontal line through the square representing the confidence interval around 
the point estimate. I refer to the horizontal lines as the squares’ ” wings”, if you 
will. 


The size of a square corresponds to the study’s weight; a larger square signifies 
a larger sample size and, therefore, a greater weight. An effect size with a greater 
weight carries more influence on the summary effect size and is therefore depicted 
by a larger square with a shorter horizontal line (Anzures-Cabrera & Higgins, 
2010). 


In a forest plot, study effects are determined as homogeneous if all the hori- 
zontal lines of the squares overlap (Petrie, Bulman, & Osborn, 2003; Ried, 2006). 
The forest plot also allows us to identify potential outliers. This can be achieved 
by examining studies whose 95% confidence intervals do not overlap with the 
confidence interval of the summary effect size (Harrer, Cuijpers, A, & Ebert, 
2021). Furthermore, it is worth noting that if large studies are identified as out- 
liers, it may suggest that the overall heterogeneity is high. 
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Figure 3. An anatomy of a basic forest plot 
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6.2 Creating forest plots in R 


In this section, we will begin by explaining how to create a basic forest plot using 
the meta package. We will also show readers how to create a more sophisticated, 
publication-ready forest plot. 

We can create a simple forest plot using the following generic code (assuming 
that we have loaded the meta package): 


pes.summary <- metaprop(cases, total, authoryear, data 
= dat, sm = "PRAW") 
forest (pes. summary) 


Using the metaprop() function, we conduct a meta-analysis of proportions 
and save the results in an object named “pes.summary”. We then feed these 
results into the forest() function to automatically generate a forest plot. The sm 
argument in the metaprop() function dictates which transformation method will 
be used to convert the original proportions: 


PRAW # no transformation 
PLO # the logit transformation 
PFT # the double arcsine transformation 


Forest plots created by the generic code are bare-boned and often fail to meet 
publishing standards. The following code can produce publication-quality forest 
plots for the running example: 


pes.summary <- metaprop(cases, total, authoryear, data 
= dat, sm = "PLO", method.tau = "DL", method.ci = 
"NAsm") 
forest (pes.summary, 
common = FALSE, 
print.tau2 = TRUE, 
print.Q = TRUE, 
print.pval.Q = TRUE, 
print.I2 = TRUE, 
rightcols = FALSE, 
pooled.totals = FALSE, 


weight.study = "random", 

leftcols = c("studlab", "event", "n", "effect", 
MEd ye, 

leftlabs = c("Study", "Cases", "Total", 
"Prevalence", "95% C.1."), 

xlab = "Prevalence of CC (%)", 

smlab = "", 


xlim = c(0,4), 
pscale = 1000, 
squaresize = 0.5, 
fs.hetstat 10, 
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digits = 2, 

col.square = "navy", 
col.square.lines = "navy", 
col.diamond = "maroon", 
col.diamond.lines = "maroon" 


The generated forest plot is shown in Figure 4. 


Study Cases Total Prevalence 95% C.l. 

Stewart-Brown 1988 7 12853 0.54 [0.26; 1.14] —#— 

SanGiovanni 2002 73 53639 1.36 [1.08; 1.71] E 

Stayte 1993 4 6687 0.60 [0.22;1.59] —#——— 

Haargaard 2004 773 2616439 0.30 [0.28;0.32] E: 

Rahi 2001 248 734000 0.34 [0.30; 0.38] 

Wirth 2002 421 1870000 0.23 [0.20; 0.25] @ 

Hu 1987 77 207319 0.37 [0.30; 0.46] @ 

Bermejo 1998 71 1124654 0.06 [0.05; 0.08] E 

Stoll 1997 57 212479 0.27 [0.21; 0.35] 

Abrahamsson 1999 136 377334 0.36 [0.30; 0.43] 

Bhatti 2003 199 982128 0.20 [0.18; 0.23] 

Halilbasic 2014 51 38133 1.34 [1.02; 1.76] -E 

Chen 2014 6 9246 0.65 [0.29;1.44] —#—— 

Yang 2014 8 6299 1.27 [0.64; 2.54] —_—17?_ 

Pi 2012 3 3079 0.97 [0.31; 3.02] ——#—_—_—__—__ 

Holmes 2003 10 33021 0.30 [0.16; 0.56] Œ- 

Nie 2008 15 15398 0.97 [0.59; 1.62] : —=— 

Random effects model š 0.42 [0.32; 0.57] = 

Heterogeneity: 1° = 97%, 1? = 0.3256, x5, = 580.54 (p < 0.01) l l I I l 
0 1 2 3 4 


Prevalence of CC (%) 


Figure 4. A publication-quality forest plot 


The arguments in forest() provided above are mostly self-explanatory. They 
determine which components of the forest plot are displayed, as well as their 
colors, sizes, and positions on the graph. The pscale argument is particularly 
noteworthy. Setting “pscale = 1000” means that the prevalence is expressed as 
events per 1,000 observations. Consequently, the combined proportion under the 
random-effects model is displayed as 0.42% in the forest plot®. It should be 
mentioned that due to space constraints, we have only listed the most essen- 
tial arguments in the forest() function. Readers are encouraged to refer to the 
documentation that comes with the meta package (type ?meta::forest() in R) to 
explore additional useful arguments for customizing their own forest plots. 


ë Readers should note that showing the permille symbol (%o) within code snippets 
in ATFX can be challenging. Consequently, the “%” is used in the zlab argument 
purely for illustrative purposes. For accurate representation, readers can substitute 
the “%” with “%o” in R. 
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We can sort the individual studies by precision to help us visually inspect 
the data. This can be achieved by sorting the included studies using SE or the 
inverse of SE: 


precision <- sqrt(ies.logit$vi) 


We then add the sortvar argument in the forest() function: 


sortvar = precision 


The new forest plot is shown in Figure 5. This forest plot clearly shows that 
the prevalence of CC is higher in smaller studies (those with longer “wings” ). 
In meta-analyses of comparative studies, a forest plot without indications of 
publication bias will exhibit an even spread of studies with varying precision on 
both sides of the mean effect size. However, in a meta-analysis of observational 
data, an uneven spread of studies may actually reflect a genuine pattern in effect 
sizes rather than publication bias, especially when small studies fall to the right 
side of the mean. It is also possible that some small studies are not published 
due to valid reasons, such as the use of inadequate research methods. Thus, this 
uneven distribution of effects warrants further investigation as it may provide 
new insights into the topic of interest. 


Study Cases Total Prevalence 95% C.l. 

Haargaard 2004 773 2616439 0.30 [0.28; 0.32] 

Wirth 2002 421 1870000 0.23 [0.20; 0.25] 

Rahi 2001 248 734000 0.34 [0.30; 0.38] 

Bhatti 2003 199 982128 0.20 [0.18; 0.23] 

Abrahamsson 1999 136 377334 0.36 [0.30; 0.43] 

Hu 1987 77 207319 0.37 [0.30; 0.46] Æ 

SanGiovanni 2002 73 53639 1.36 [1.08; 1.71] E 
Bermejo 1998 71 1124654 0.06 [0.05; 0.08] E 

Stoll 1997 57 212479 0.27 [0.21; 0.35] 

Halilbasic 2014 51 38133 1.34 [1.02; 1.76] = 

Nie 2008 15 15398 0.97 [0.59; 1.62] : —— 
Holmes 2003 10 33021 0.30 [0.16; 0.56] Œ- 

Yang 2014 8 6299 1.27 [0.64; 2.54] : —E 
Stewart-Brown 1988 7 12853 0.54 [0.26; 1.14] —#— 

Chen 2014 6 9246 0.65 [0.29; 1.44] —=—— 

Stayte 1993 4 6687 0.60 [0.22; 1.59) —=—— 

Pi 2012 3 3079 0.97 [0.31; 3.02] +———=———————— 
Random effects model 0.42 [0.32; 0.57] + 


Heterogeneity: 1? = 97%, 1? = 0.3256, x?, = 580.54 (p < 0.01) I l T j l 
0 1 2 3 4 
Prevalence of CC (%o) 


Figure 5. A forest plot with sorted studies by precision 


A visual inspection of the forest plot identifies several potential outlying 
studies, including Wirth (2002), Bhatti (2003), SanGiovanni (2002), Bermejo 
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(1998), Halilbasic (2014), Nie (2008), and Yang (2014). Their 95% CIs do not 
overlap with that of the summary proportion. In the next step, we will cross- 
validate these potential outliers using the Baujat plot. 


6.3 Identifying outlying and influential studies with diagnostic tools 


When dealing with high between-study heterogeneity in a meta-analysis, one 
approach is to identify and exclude outliers, and then reassess the robustness of 
the summary effect size. In this section, we will introduce some diagnostic tools 
that can identify outlying and influential studies. 

A basic Baujat plot is depicted in Figure 6. The horizontal axis of the Bau- 
jat plot quantifies each study’s contribution to the overall heterogeneity or the 
Cochran Q-test, while the vertical axis measures the impact of each study on 
the summary effect size. We’ve divided the Baujat plot into four quadrants with 
light blue dotted lines for illustration purposes. Studies situated far to the right 
on the horizontal axis (in Quadrants 2 and 3) are significant contributors to 
heterogeneity. Those positioned far up on the vertical axis (in Quadrants 1 and 
2) substantially influence the overall meta-analysis result. A study’s influence 
is deemed substantial if its removal would lead to a drastically different overall 
effect. 
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Figure 6. An anatomy of a basic Baujat plot 
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It can sometimes be challenging to differentiate between the concepts of an 
“outlier” and an “influential effect size” in the context of meta-analysis. While 
an outlying effect size can often be influential, it isn’t always so. Conversely, an 
effect size that is influential doesn’t necessarily have to be an outlier (Harrer et 
al., 2021). The Baujat plot helps distinguish between outliers that are influential 
and those that are not: 


Small studies with effect sizes similar to others typically fall into the lower 
left corner of Quadrant 4, indicating they are neither outliers nor influential. 
— Small studies with notably different effect sizes than others often appear in 
the lower right corner of Quadrant 3. They may be outliers, but their small 
sample sizes prevent them from heavily impacting the overall effect size. 
Large studies with effect sizes similar to the majority of effect sizes tend 
to populate the upper left corner of Quadrant 1. While these studies have 
influential effects, they may not be outliers. Their influence on the pooled 
effect size is pronounced because of their extensive sample sizes. 

Large studies with dramatically different effect sizes than the rest tend to 
appear in the upper right corner of Quadrant 2. These studies are influential 
outliers, exerting the most substantial impact on both the overall effect and 
heterogeneity. 


It is crucial to conduct several formal diagnostic tests to determine if the out- 
lying effect sizes identified in the forest plot and Baujat plot are truly outliers. If 
deemed outliers, further investigation is required to determine their actual influ- 
ence on the overall effect size. Viechtbauer and Cheung (2010) have proposed a 
set of case deletion diagnostics derived from linear regression analyses to identify 
influential studies, such as difference in fits values (DFFITS), Cook’s distances, 
leave-one-out estimates for the amount of heterogeneity (i.e., T?) as well as the 
test statistic for heterogeneity (i-e., Q-statistic). In leave-one-out analyses, each 
study is removed sequentially, and the summary proportion is re-estimated based 
on the remaining n-1 studies. This approach allows for the assessment of each 
study’s influence on the summary proportion. 

Outlying effect sizes can also be identified by screening for externally stu- 
dentized residuals exceeding an absolute value of 2 or 3 (Tabachnick, Fidell, & 
Osterlind, 2013; Viechtbauer & Cheung, 2010). 

As a final note, instead of simply removing outlying effect sizes, meta-analysts 
should investigate these outliers and influential cases to understand their occur- 
rence. They sometimes reveal valuable study characteristics that may serve as 
potential moderating variables. 


6.4 Identifying outlying and influential studies in R 


In this section, we will use the Baujat plot and diagnostic tests introduced above 
to detect outliers and influential studies. The generic code for Baujat plot is 
provided below: 


baujat(pes) # or pes.logit, pes.da 
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For the running example, use the following code to create a customized Bau- 
jat plot: 


# Create a Baujat plot 
bjplot <- baujat(pes.logit, 
symbol=19, 
xlim=c(0,15), 
xlab="Contribution to Overall 
Heterogeneity", 
ylab="Influence on Summary 
Proportion") 
# Label those studies located in the upper quadrants 
bjplot <- bjplot[bjplot$x >= 10 | bjplot$y >= 0.4,] 
text (bjplot$x, bjplot$y, bjplot$slab, pos=1) 


The generated plot can be seen in Figure 7. In this customized Baujat plot, we 
have labeled only a few of the more “extreme” studies, specifically: SanGiovanni 
(2002) (Study 2), Bermejo (1998) (Study 8), and Halilbasic (2014) (Study12). We 
observe that both Study 2 and Study 12 may be considered influential, though 
they might not contribute heavily to the overall heterogeneity. In contrast, Study 
8 stands out as an influential outlier, as it has a large impact on both the pooled 
proportion and heterogeneity. 
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Figure 7. A basic Baujat plot 
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Next, we screen for large externally studentized residuals (ESR). The code 
below calculates the ESR for each study in the current dataset, then sorts them 
in descending order based on the absolute values of the z-scores tied to their 
respective ESRs: 


# Calculate ESR 

stud.res <- rstudent(pes.logit) # or pes, pes.da 
# Sort ESR by z-values in descending order 

abs.z <- abs(stud.res$z) 

stud.res[order(-abs.z)] 


The test outcome appears in Figure 8. The key here is to locate studies with 
z-values that exceed an absolute value of 2 or 3. Since we only have 17 studies 
in the running example, we will set the threshold at 2. Therefore, the second, 
eighth, and twelfth studies are chosen. They match the studies we previously 
identified through the Baujat plot. 


resid se z 
8 -2.0265 0.5183 -3.9101 
2 1.2701 0.5183 2.4505 
12 1.2415 0.5541 2.2407 
14 1.1563 0.6831 1.6928 
17 0.8840 0.6382 1.3853 
11 -0.7967 0.6198 -1.2854 
6 -0.6895 0.6576 -1.0485 
15 0.8618 0.8254 1.0441 
9 -0.4925 0.6177 -0.7973 
13 0.4459 0.7182 0.6209 
4 -0.4063 0.7250 -0.5604 
16 -0.3563 0.6727 -0.5297 
3 0.3579 0.7743 0.4622 
5 -0.2520 0.6444 -0.3911 
1 0.2627 0.7021 0.3741 
10 -0.1790 0.6231 -0.2872 
7 -0.1447 0.6162 -0.2348 


Figure 8. Externally studentized residuals results 


The following code performs a set of leave-one-out diagnostic tests: 


# Option 1: no transformation 

# L10 stands for leave-one-out 

L10 <- leavetout(pes); print(L10) 

# Option 2: the logit transformation 

L10 <- leavetout(pes.logit, transf = transf.ilogit) 
print (L10) 
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# Option 3: the double arcsine transformation 

# targ can also be set to list(ni = 1/(pes.da$se) ~2) 

L10 <- leavetout(pes.da, transf = transf.ipft.hm, targ 
= list(ni = dat$total)) 

print (L10) 


Using the current data set, we execute the following code: 


L10 <- leavetout(pes.logit, transf = transf.ilogit) 
print(L10, digits = 6) 


The output is shown in Figure 9. The numbers in the first column are the 
leave-one-out estimates for the summary proportion, which are derived by ex- 
cluding one study at a time from the included studies. For instance, the first 
estimate in this column (i.e., 0.000419) is the summary proportion estimate 
when the first study in the included studies is removed. 


estimate zval pval ci.lb ci.ub q Qp tau2 12 H2 
1 0.000419 -50.492057 0.000000 0.000310 0.000566 577.938615 0.000000 0.326294 97.404569 38.529241 
2 0.000383 -58.124097 0.000000 0.000293 0.000499 405.001830 0.000000 0.236593 96.296313 27.000122 
3 0.000418 -50.760057 0.000000 0.000310 0.000565 578.562279 0.000000 0.325980 97.407366 38.570819 
4 0.000443 -41.417189 0.000000 0.000308 0.000639 580.526132 0.000000 0.489631 97.416137 38.701742 
5 0.000435 -46.319695 0.000000 0.000313 0.000603 575.730710 0.000000 0.383340 97.394615 38.382047 
6 0.000449 -45.217145 0.000000 0.000321 0.000626 540.974670 0.000000 0.400959 97.227227 36.064978 
7 0.000429 -48.854385 0.000000 0.000315 0.000586 576.473491 0.000000 0.341576 97.397972 38.431566 
8 0.000479 -56.505027 0.000000 0.000367 0.000624 404.914535 0.000000 0.236229 96.295515 26.994302 
9 0.000439 -48.899481 0.000000 0.000322 0.000598 579.956198 0.000000 0.338978 97.413598 38.663747 
10 0.000431 -47.992385 0.000000 0.000314 0.000591 574.985048 0.000000 0.354815 97.391237 38.332337 
11 0.000449 -47.824077 0.000000 0.000328 0.000616 548.816035 0.000000 0.353117 97.266844 36.587736 
12 0.000387 -55.147300 0.000000 0.000293 0.000511 461.941616 0.000000 0.267048 96.752836 30.796108 
13 0.000416 -50.664580 0.000000 0.000308 0.000562 576.843109 0.000000 0.325434 97.399640 38.456207 
14 0.000400 -51.312960 0.000000 0.000297 0.000539 563.535902 0.000000 0.318164 97.338235 37.569060 
15 0.000412 -51.101371 0.000000 0.000305 0.000555 576.283345 0.000000 0.324438 97.397114 38.418890 
16 0.000432 -50.008941 0.000000 0.000319 0.000586 580.534075 0.000000 0.328479 97.416172 38.702272 
17 0.000403 -51.136341 0.000000 0.000298 0.000543 559.149838 0.000000 0.317149 97.317356 37.276656 


Figure 9. Results of leave-one-out diagnostic meta-analyses 


A leave-one-out forest plot can visualize the change in the summary effect 
size. The generic code is given below: 


# Option 1: no transformation 

lio <- leavelout (pes) 

yi <- llo$estimate; vi <- llo$se2 
forest (yi, 


vi, 

slab = paste(dat$author, dat$year, sep = ","), 
refline = pes$b, 

xlab = "Leave-one-out summary proportions") 


# Option 2: the logit transformation 
lio <- leavelout (pes.logit) 
yi <- lilo$estimate; vi <- lilo$se2 
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forest (yi, 
vi, 
transf = transf.ilogit, 
slab = paste(dat$author, dat$year, sep = ","), 
refline = pes$pred, 
xlab = "Leave-one-out summary proportions") 


# Option 3: the double arcsine transformation 
# targ can also be set to list(ni = 1/(pes.da$se) “~2) 
lio <- leavelout (pes.da) 
yi <- lilo$estimate; vi <- llo$se2 
forest (yi, 
vi, 
transf = transf.ipft.hn, 
targ = list(ni = dat$total), 


slab = paste(dat$author, dat$year, sep = ","), 
refline = pes$pred, 
xlab = "Leave-one-out summary proportions") 


To generate a customized leave-one-out forest plot for the current data set, 
use the following code: 


lio=leavelout (pes.logit) 

yi=lio$estimate; vi=l1o$se^2 

forest (yi, 
vi, 
transf=transf.ilogit, 
slab=paste(dat$author ,dat$year,sep=", "), 
xlab="Leave-one-out summary proportions", 
refline=pes$pred, 
digits=6) 

abline (h=0.1) 


The generated forest plot is shown in Figure 10. Each black square repre- 
sents a leave-one-out summary proportion. The reference line indicates where 
the original summary proportion lies. The further a box deviates from the refer- 
ence line, the more pronounced the impact of the corresponding excluded study 
will be on the original summary proportion. For instance, if we exclude the study 
by SanGiovanni et al. (2002), the new summary proportion becomes 0.00038. If 
we exclude Stayte et al. (1993), the new summary proportion becomes 0.000418. 
Apparently, excluding the former study has a larger impact on the original sum- 
mary proportion than the latter study. 
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Stewart-Brown, 1988 = 0.000419 [0.000310, 0.000566 
SanGiovanni, 2002 — ue 0.000383 [0.000293, 0.000499 
Stayte, 1993 = 0.000418 [0.000310, 0.000565. 
Haargaard, 2004 . 0.000443 [0.000308, 0.000639 
Rahi, 2001 = 0.000435 [0.000313, 0.000603 
Wirth, 2002 = 0.000449 [0.000321, 0.000626 
Hu, 1987 s 0.000429 [0.000315, 0.000586 
Bermejo, 1998 E 0.000479 [0.000367, 0.000624 
Stoll, 1997 2 0.000439 [0.000322, 0.000598 
Abrahamsson, 1999 2 0.000431 [0.000314, 0.000591 
Bhatti, 2003 = 0.000449 [0.000328, 0.000616 
Halilbasic, 2014 = 0.000387 [0.000293, 0.000511 
Chen, 2014 z 0.000416 [0.000308, 0.000562 
Yang, 2014 = 0.000400 [0.000297, 0.000539 
Pi, 2012 = 0.000412 [0.000305, 0.000555. 
Holmes, 2003 = 0.000432 [0.000319, 0.000586 
Nie, 2008 = 0.000403 [0.000298, 0.000543 
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Figure 10. A leave-one-out forest plot 


With these potential influential studies in mind, we now conduct a few more 
leave-one-out diagnostics with the influence() function in metafor to verify our 
guesses: 


inf <- influence (pes.logit) 
print (inf, digits=3) 
plot (inf) 


In Figure 11, studies marked with an asterisk are potential influential studies: 


rstudent dffits cook.d cov.r tau2.del QE.del hat weight dfbs inf 


1 0.374 0.083 0.007 1.052 0.326 577.939 0.048 4.811 0.083 
2 2.451 0.801 0.474 0.813 0.237 405.002 0.066 6.643 0.791 * 
3 0.462 0.093 0.009 1.042 0.326 578.562 0.039 3.915 0.093 
4 -0.560 -0.242 0.088 1.541 0.490 580.526 0.069 6.896 -0.247 
5 -0.391 -0.151 0.027 1.239 0.383 575.731 0.068 6.839 -0.152 
6 -1.049 -0.336 0.139 1.289 0.401 540.975 0.069 6.873 -0.339 
7 -0.235 -0.077 0.006 1.117 0.342 576.473 0.067 6.658 -0.077 
8 -3.910 -0.941 0.653 0.812 0.236 404.915 0.066 6.636 -0.929 * 
9 -0.797 -0.223 0.052 1.109 0.339 579.956 0.066 6.569 -0.224 
10 -0.287 -0.103 0.011 1.156 0.355 574.985 0.068 6.770 -0.103 
1r -1.285 -0.369 0.148 1.152 0.353 548.816 0.068 6.818 -0.371 
12 2.241 0.674 0.377 0.900 0.267 461.942 0.065 6.530 0.669 
13 0.621 0.136 0.019 1.047 0.325 576.843 0.046 4.578 0.136 
14 1.693 0.395 0.153 1.031 0.318 563.536 0.050 5.001 0.395 
15 1.044 0.197 0.039 1.032 0.324 576.283 0.034 3.420 0.198 
16 -0.530 -0.128 0.017 1.064 0.328 580.534 0.053 5.296 -0.128 
17 1.385 0.350 0.120 1.036 0.317 559.150 0.057 5.746 0.350 


Figure 11. Results of the influential study test 
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The diagnostics plots in Figure 12 show that the second and eighth studies 
are colored in red, indicating that they fulfill the criteria as influential studies. 


rstudent dffits 
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Figure 12. Influential study diagnostics 


Based on the Baujat plot and the outcomes of the diagnostic tests, we deter- 
mine that all three studies (Study 2, 8, and 12) can be considered outliers, but 
only Study 2 and 8 are deemed influential. 


6.5 Removing outlying studies in R 


Once all possible outliers are identified, we can remove them with the following 
generic code: 
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# Depending on the transformation method, 

# measure = "PLO" or measure = "PFT" 

# Remember to add "add = 0" when using the 

# double arcsine transformation 

ies.noutlier <- escalc(xi = cases, ni = total, measure 
= "PR", data = dat[-c(studyl, study2,),]) 


If we were to exclude Study 2 and Study 8 in the current data set, we would 
execute the following code: 


# Remove the two studies and calculate individual 

# effect sizes 

ies.logit.noutlier <- escalc(xi = cases, ni = total, 
measure = "PLO", data = dat[-c(2, 8),]) 

# Conduct meta-analysis with no outliers 

pes.logit.noutlier <- rma(yi, vi, data = 
ies.logit.noutlier, method = "DL") 

pes.noutlier <- predict(pes.logit.noutlier, transf = 
transf.ilogit) 

print(pes.noutlier, digits = 5) 


7 Explanation of heterogeneity with moderator analyses 


We’ve determined that our data shows significant heterogeneity. Furthermore, 
we identified several outlying studies that notably impact both the overall effect 
and the variability of the observed effect sizes. When substantial heterogeneity 
remains even after excluding these outliers, one commonly employed strategy 
to unearth additional sources of heterogeneity is through moderator analyses. 
In fact, a thorough moderator analysis can often yield deeper insights than a 
mere estimate of summary effect size. This analysis helps identify and quantify 
the extent to which certain study-level characteristics contribute to the observed 
heterogeneity. 

Subgroup analysis and meta-regression are two major forms of moderator 
analysis. Subgroup analysis can be seen as a special case of meta-regression, 
which examines the impact of a single categorical variable (Thompson & Higgins, 
2002). In fact, meta-regression can accommodate both categorical and continu- 
ous moderators of desired numbers. For instance, a meta-regression can include 
a series of continuous variables or a mix of both continuous and categorical vari- 
ables. In this tutorial, our focus will be on subgroup analysis and meta-regression 
with a continuous moderator. 


7.1 Meta-regression with a categorical moderator: Subgroup 
analysis 


When we want to explain heterogeneity with a categorical moderator in a meta- 
analysis, subgroup analysis is the method of choice. This approach mirrors the 
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logic of ANOVA in primary research (Littell, Corcoran, & Pillai, 2008). In a sub- 
group analysis, studies are partitioned into two or more subgroups according to 
the categories within the moderator. This moderator represents a specific study 
characteristic that can potentially explain a portion of the variability observed 
between studies (Hamza et al., 2008). If a subgroup has a unique characteristic 
absent in other subgroups (e.g., exposure to a new treatment vs. an old treat- 
ment), and the effect sizes between the subgroups show significant differences, 
it suggests that the variation in effect sizes (i.e., the true heterogeneity) can 
be attributed to this unique characteristic. In essence, the purpose of subgroup 
analysis is to ascertain if the chosen moderator accounts for a significant portion 
of the true heterogeneity. 

To evaluate the influence of a proposed moderator, we apply a weighted least 
squares (WLS) regression. In this approach, effect sizes (e.g., those transformed 
using logit or double arcsine methods) are regressed against the moderator (Har- 
rer et al., 2021): 


ES; = Bo + (iC + 6; + ei (17) 


where ES; is the observed effect size for the primary study i, C is the dummy 
variable representing the moderator (or predictor), 31 is the regression coefficient 
(or slope), and 6o is the model intercept. ô; and e; are error terms. Specifically, 
6; is the between-study error for the primary study i, with its variance being 
the between-study variance, 77; e; is the sampling error for the primary study i, 
with its variance being the within-study variance. The goal of the meta-regression 
model is to estimate the parameters, 89 and £81. 

The categorical moderator is introduced in the analysis through dummy cod- 
ing (e.g., the “studesg” variable in our data set). Let’s say we have two categories 
within this predictor: Subgroup A and Subgroup B. If Subgroup A is chosen as 
the reference group, then all primary studies in Subgroup A would be coded as 
0, while those in Subgroup B would be coded as 1. Mathematically, this can be 
represented as C = 0 for Subgroup A and C = 1 for Subgroup B. The regression 
coefficient of C, 81, quantifies the effect size difference between the two sub- 
groups. When C = 0, Bo becomes the true overall effect of Subgroup A. When 
C = 1, the overall effect of Subgroup B is captured by the sum ĝo and 81. In 
summary, the observed effect size for the study i, E S;, is an estimator of the 
study’s true effect size, 8o + 81C + 6;, burdened by the sampling error, e;. 

Eq. (17) is a mixed-effects meta-regression model, a standard choice for meta- 
regression. In subgroup analyses, this model combines the study effects within 
each subgroup using a random-effects model, while a fixed-effect model is used 
to combine subgroups and yield the overall effect (Borenstein et al., 2009). A 
Wald-type test is used in meta-regression to determine if the slope of the model 
is statistically significant, using the Z-score. In subgroup analyses, a statistically 
significant slope suggests that Subgroups A and B exhibit statistically signifi- 
cant differences between their overall effect sizes. In other words, the subgroup 
membership can explain some or all of the between-study heterogeneity. Another 
method to assess a moderator’s impact in meta-regression is through Cochran’s 
Q. In subgroup analyses, if the Q-statistic for the predictor is statistically sig- 
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nificant, it means that the subgroup membership explains some or the entirety 
of the variability observed in the effect sizes. The R? index can be employed 
in meta-regression to quantify the proportion of the true heterogeneity across 
all studies (i.e., the between-study heterogeneity) that can be accounted for by 
moderators. 


7.2  Meta-regression with a continuous moderator 


In a meta-regression model with a single continuous moderator, as shown in Eq. 
(18) (Harrer et al., 2021), 


ES; = bo + fia + Ôi + ei (18) 


x; represent a continuous moderator, 3; is the regression slope. 6; and e; are the 
between- and within-study error terms for the study i, respectively. 6o is still 
the model intercept, but it now represents the overall true effect size when z = 
0. In summary, E'S; represents the observed effect size for the study i, which is 
an estimator of the study’s true effect size, Bo + 8,2”; + ĝi, burdened by the 
sampling error, ei. 

As summarized by Harrer et al. (2021), meta-regression analyzes the rela- 
tionship between predictors and observed effects to identify a consistent pattern 
between them, in the form of a regression line. By accounting for both sam- 
pling error and between-study differences, meta-regression seeks to fit a model 
that can generalize across all possible studies relevant to the topic. A well-fitting 
meta-regression model can predict effect sizes close to the observed data. 


An important caveat 

Moderator analysis is subject to several limitations that should be taken 
into consideration. A primary issue is that both the subgroup analysis and 
meta-regression require a large ratio of studies to moderators. It is generally 
recommended that moderator analysis should only be conducted when there 
are at least 10 studies available for each moderator included in the analysis. 
This is particularly crucial in multivariate models where the number of 
studies might be small, leading to reduced statistical power (Higgins & 
Green, 2006; Littell et al., 2008). 

Another significant limitation is that the significant differences observed 
between subgroups of studies cannot be seen as causal evidence. We may 
fail to identify moderators that are truly responsible for the heterogene- 
ity in effect sizes. Consequently, causal conclusions cannot be drawn solely 
from moderator analyses (Cuijpers, 2016; Littell et al., 2008). We strongly 
recommend that researchers select moderators based on solid theoretical 
reasoning and only test those moderators with a strong theoretical basis. 
This approach helps prevent erroneously attributing heterogeneity to spu- 
rious moderators (Schmidt & Hunter, 2014). 
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7.3 Conducting subgroup analyses and recalculating the overall 
summary proportion in R 


In a mixed-effects model meta-regression, the summary effect size for each sub- 
group is computed using a random-effects model. Instead of estimating T? across 
all studies, it’s estimated within these subgroups. In other words, each subgroup 
has its own estimated 7?. These 7? estimates may vary across subgroups. We 
can choose to pool them or keep them separate when we compute the over- 
all and within-subgroup summary proportions, depending on our assumptions 
(Borenstein et al., 2009). 

If we attribute the differences in these observed within-group T^ estimates 
solely to sampling error, then we anticipate a common 7? across subgroups. In 
such a scenario, pooling a common 7? estimate and applying it universally to all 
studies is appropriate. Conversely, if systematic factors, beyond just sampling 
errors, are believed to influence the varying values of the observed within-group 
T? estimates, then employing distinct 7? estimates for each subgroup is justi- 
fied. Essentially, using a separate estimate for between-study variance is equal 
to conducting an independent meta-analysis for each subgroup. It’s important 
to emphasize that the pooled proportion across all subgroups is likely to differ 
from the summary proportion derived from pooling across all studies without 
subgrouping. Nevertheless, any differences in these estimates are generally neg- 
ligible. 

When we assume that 7? is the same for all subgroups, we can use the R? 
index to represent the proportion of the between-study variance across all studies 
that can be explained by the subgroup membership (Borenstein et al., 2009). 

We have developed the following generic code to help readers perform sub- 
group analyses and compute the overall and within-subgroup summary propor- 
tions. It is essential for readers to gain a thorough understanding of their data’s 
characteristics to choose the appropriate computational option. 

In the first situation, we do not assume a common between-study variance 
component across subgroups and thus do not pool within-group 7? estimates. In 
R, we first fit a random-effects model for each subgroup, and then we combine the 
estimated statistics into a data frame. In the next step, we fit a fixed-effect model 
to compare the two estimated logit transformed proportions and recalculate the 
summary proportion. The generic code is provided below: 


2 


2 


# Assumption 1: 

# Do not assume a common between-study variance 

# component (not pooling within-group estimates of 

# between-study variance) 

# Option 1: no transformation 

# Conduct a random-effects model meta-analsis for each 

# subgroup defined by the moderator variable 

pes.subgroup1 <- rma(yi, vi, data = ies, subset = 
moderator == "subgroupi") 

pes.subgroup2 <- rma(yi, vi, data = ies, subset = 


moderator == "subgroup2") 


98 N. Wang 


Create a dataframe to store effect size estimates, 

standard errors, heterogeneity for both subgroups 

Add an object named moderator to distinguish two 

subgroups. It will be used in the next step. 

dat.diffvar <- data.frame(estimate = 
c(pes.subgroupi$b, pes.subgroup2$b), stderror = 
c(pes.subgroupl$se, pes.subgroup2$se), moderator = 
c("subgroupi", "subgroup2"), tau2 = 
round(c(pes.subgroupi$tau2, pes.subgroup2$tau2) , 
3)) 

# Fit a fixed-effect meta-regression to compare the 

# subgroups 

subganal.moderator <- rma(estimate, sei = stderror, 
mods = ~ moderator, method = "FE", data = 
dat.diffvar) 

# Recalculate summary effect size assuming different 


# 
# 
# 
# 


# heterogeneity components 

pes.moderator <- rma(estimate, sei = stderror, method 
= "FE", data = dat.diffvar) 

pes.moderator <- predict (pes.moderator) 

# Display subgroup 1 summary effect size 

print (pes. subgroup!) 

# Display subgroup 2 summary effect size 

print (pes. subgroup2) 

# Display subgroup analysis results 

print (subganal.moderator) 

# Display recomputed summary effect size 

print (pes.moderator) 


# Option 2: the logit transformation 

# Conduct a random-effects model meta-analsis for each 
# subgroup defined by the moderator variable 
pes.logit.subgroup1 <- rma(yi, vi, data = ies.logit, 


subset = moderator == "subgroupi") 
pes.logit.subgroup2 <- rma(yi, vi, data = ies.logit, 
subset = moderator == "subgroup2") 


pes.subgroup1 <- predict(pes.logit.subgroupi!, transf 
= transf.ilogit) 

pes.subgroup2 <- predict(pes.logit.subgroup2, transf 
= transf.ilogit) 

Create a dataframe to store effect size estimates, 

standard errors, heterogeneity for both subgroups 

Add an object named moderator to distinguish two 

subgroups. 


# HHH 
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dat.diffvar <- data.frame(estimate = 
c(pes.logit.subgroupl$b, pes.logit.subgroup2$b) , 
stderror = c(pes.logit.subgroupl$se, 
pes.logit.subgroup2$se), moderator = 
c("subgroupi", "subgroup2"), tau2 = 
round(c(pes.logit.subgroupi$tau2, 
pes. logit.subgroup2$tau2), 3)) 

# Fit a fixed-effect meta-regression to compare the 

# subgroups 

subganal.moderator <- rma(estimate, sei = stderror, 
mods = ~ moderator, method = "FE", data = 
dat.diffvar) 

# Recalculate summary effect size assuming different 

# heterogeneity components 

pes.logit.moderator <- rma(estimate, sei = stderror, 
method = "FE", data = dat.diffvar) 

pes.moderator <- predict(pes.logit.moderator, transf = 
transf.ilogit) 

# Display subgroup 1 summary effect size 

print (pes.subgroup1); print(pes.logit.subgroup1) 

# Display subgroup 2 summary effect size 

print (pes.subgroup2); print (pes.logit.subgroup2) 

# Display subgroup analysis results 

print (subganal.moderator) 

# Display recomputed summary effect size 

print (pes.moderator) 


# Option 3: the double arcsine transformation 
# Conduct a random-effects model meta-analsis for each 
# subgroup defined by the moderator variable 

# targ can also be set to list(ni = 1/(pes.da$se)~2) 

P 


es.da.subgroupl <- rma(yi,vi,data = ies.da, subset = 
moderator == "subgroupi") 

pes.da.subgroup2 <- rma(yi,vi,data = ies.da, subset = 
moderator == "subgroup2") 


pes.subgroup1 <- predict(pes.da.subgroup1, transf 
transf.ipft.hm,targ = list(ni = dat$total)) 
pes.subgroup2 <- predict(pes.da.subgroup2, transf 
transf.ipft.hm,targ = list(ni = dat$total)) 
Create a dataframe to store effect size estimates, 
standard errors, heterogeneity for both subgroups 
Add an object named moderator to distinguish two 
subgroups. 
dat.diffvar <- data.frame(estimate = 
c(pes.da.subgroupl$b, pes.da.subgroup2$b) , 


# 
# 
# 
# 
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stderror = c(pes.da.subgroupl$se, 
pes.da.subgroup2$se), moderator = c("subgroupi", 
"subgroup2"), tau2 = 
round(c(pes.da.subgroupi$tau2, 
pes.da.subgroup2$tau2), 3)) 

# Fit a fixed-effect meta-regression to compare the 

# subgroups 

subganal.moderator <- rma(estimate, sei = stderror, 
mods = ~ moderator, method = "FE", data = 
dat.diffvar) 

# Recalculate summary effect size assuming different 

# heterogeneity components 

# targ can also be set to list(ni = 1/(pes.da$se)~2) 

pes.da.moderator <- rma(estimate, sei = stderror, 
method = "FE", data = dat.diffvar) 

pes.moderator <- predict(pes.da.moderator, transf = 
transf.ipft.hm, targ = list(ni = dat$total)) 

# Display subgroup 1 summary effect size 

print (pes.subgroup1); print (pes .da.subgroup1) 

# Display subgroup 2 summary effect size 

print (pes.subgroup2); print (pes.da.subgroup2) 

# Display subgroup analysis results 

print (subganal.moderator) 

# Display recomputed summary effect size 

print (pes.moderator) 


In the second situation, we assume a common between-study variance compo- 
nent across subgroups and pool within-group 7? estimates. Generally speaking, 
unless there is a substantial number of studies available within each subgroup 
(i.e., more than five studies) or compelling evidence suggesting within-group 
variances vary from one subgroup to the next, it is sufficient to calculate sum- 
mary proportions and create forest plots with a pooled 7? (Borenstein et al. 
(2009)). In this case, we can directly use the rma() function and fit a mixed- 
effects model to evaluate the potential moderator. In R, we still need to combine 
the estimated statistics into a new data frame for us to calculate a new overall 


summary proportion using a pooled 7? across all studies. 


Assumption 2: Assume a common between-study variance 


# 

# component (pool within-group estimates of 

# between-study variance) 

# Option 1: no transformation 

# Conduct moderator analysis 

subganal.moderator <- rma(yi, vi, data = ies, mods = 
moderator) 

pes.subg.moderator <- predict (subganal.moderator) 

# Obtain estimates for each subgroup 
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pes.subgroup1i <- rma(yi, vi, data = ies, mods = ~ 
moderator == "subgroup2") 

pes.subgroup2 <- rma(yi, vi, data = ies, mods = ~ 
moderator == "subgroupi") 


# Create a dataframe to store effect size estimates, 

# standard errors, heterogeneity for both subgroups 

dat.samevar <- data.frame(estimate = 
c((pes.subgroupi$b) [1], (pes.subgroupi$b) [1]), 
stderror = c((pes.subgroup2$se) [1], 
(pes.subgroup2$se) [1]), tau2 = 
subganal.moderator$tau2) 

# Recalculate summary effect size assuming a common 

# heterogeneity component 

pes.moderator <- rma(estimate, sei = stderror, method 
= "FE", data = dat.samevar) 

pes.moderator <- predict (pes.moderator) 

# Display subgroup 1 summary effect size 

print (pes.subg.moderator[study label 1]) 

# Display subgroup 2 summary effect size 

print (pes.subg.moderator[study label 2]) 

# Display subgroup analysis results 

print (subganal.moderator ) 

# Display recomputed summary effect size 

print (pes.moderator) 


# Option 2: the logit transformation 

# Conduct moderator analysis 

subganal.moderator <- rma(yi, vi, data = ies.logit, 
mods = ~ moderator) 

pes.subg.moderator <- predict (subganal.moderator , 
transf=transf.ilogit) 

# Obtain estimates for each subgroup 


pes.logit.subgroup1 <- rma(yi, vi, data 
mods = ~ moderator == "subgroup2") 


ies.logit, 


pes.logit.subgroup2 <- rma(yi, vi, data ies.logit, 
mods =~ moderator == "subgroup1i") 

# Create a dataframe to store effect size estimates, 

# standard errors, heterogeneity for both subgroups 

dat.samevar <- data.frame(estimate = 


c((pes.logit.subgroupi$b) [1] ,(pes.logit.subgroup2$b) [1]), 


stderror = 
c((pes.logit.subgroupi$se) [1] ,(pes.logit.subgroup2$se) [1]), 
tau2 = subganal.moderator$tau2) 


# Recalculate summary effect size assuming a common 
# heterogeneity component 
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pes.logit.moderator <- rma(estimate, sei = stderror, 
method = "FE", data = dat.samevar) 

pes.moderator <- predict(pes.logit.moderator, transf = 
transf.ilogit) 

# Display subgroup 1 summary effect size 

print (pes.subg.moderator[study lable 1]) 

# Display subgroup 2 summary effect size 

print (pes.subg.moderator[study lable 2]) 

# Display subgroup analysis results 

print (subganal. moderator) 

# Display recomputed summary effect size 

print (pes.moderator) 


# Option 3: the double arcsine transformation 

# Conduct moderator analysis 

# targ can also be set to list(ni = 1/(pes.da$se)~2) 

subganal.moderator <- rma(yi, vi, data = ies.da, mods 
= ~ moderator) 

pes.subg.moderator <- predict (subganal.moderator, 
transf = transf.ipft.hm, targ = list (ni=dat$total)) 

# Obtain estimates for each subgroup 

pes.da.subgroup1 <- rma(yi, vi, data 

moderator == "subgroup2") 
pes.da.subgroup2 <- rma(yi, vi, data 
moderator == "subgroupi") 

# Create a dataframe to store effect size estimates, 

# standard errors, heterogeneity for both subgroups 

dat.samevar <- data.frame(estimate = 
c((pes.da.subgroup1$b) [1], 
(pes.da.subgroup2$b) [1]), stderror = 
c((pes.da.subgroupi$se) [1], 
(pes.da.subgroup2$se) [1]), tau2 = 
subganal.moderator$tau2) 

# Recalculate summary effect size assuming a common 

# heterogeneity component 

# targ can also be set to list(ni = 1/(pes.da$se)~2) 

pes.da.moderator <- rma(estimate, sei = stderror, 
method = "FE", data = dat.samevar) 

pes.moderator <- predict(pes.da.moderator, transf = 
transf.ipft.hm, targ = list(ni = dat$total)) 

# Display subgroup 1 summary effect size 

print (pes.subg.moderator[study lable 1]) 

# Display subgroup 2 summary effect size 

print (pes.subg.moderator[study lable 2]) 

# Display subgroup analysis results 


ies.da, mods 


ies.da, mods 
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print (subganal. moderator) 
# Display recomputed summary effect size 
print (pes.moderator) 


To help readers better understand how to use the code templates, we will now 
illustrate their implementation with the running example. For demonstrative 
purposes, we will use the variable “study design” (Birth cohort vs. Others) as 
the moderator and conduct the analysis with the logit transformation under 
both assumptions. 


In the first situation, we do not assume a common between-study variance 
component across subgroups: 


# Conduct a random-effects model meta-analsis for each 

# subgroup defined by the moderator studydesign 

pes.logit.birthcohort <- rma(yi, vi, data=ies.logit, 
subset=studydesign == "Birth cohort", method="DL") 

pes.logit.others <- rma(yi, vi, data=ies.logit, 
subset=studydesign == "Others", method = "DL") 

pes.birthcohort <- predict(pes.logit.birthcohort , 
transf = transf.ilogit, digits = 5) 

pes.others <- predict(pes.logit.others, transf = 

transf.ilogit, digits = 5) 

Create a dataframe to store effect size estimates, 

standard errors, heterogeneity for both subgroups 

Add an object named studydesign to distinguish two 

subgroups. 

dat.diffvar <- data.frame(estimate = 
c(pes.logit.birthcohort$b, pes.logit.others$b) , 
stderror = c(pes.logit.birthcohort$se, 
pes.logit.others$se), studydesign = c("Birth 
cohort", "Others"), tau2 = 
round(c(pes.logit.birthcohort$tau2, 
pes.logit.others$tau2), 3)) 

# Fit a fixed-effect meta-regression to compare the 

# subgroups 


# 
# 
# 
# 


subganal.studydesign <- rma(estimate, sei = stderror, 
data = dat.diffvar, mods = ~ studydesign, method = 
"PE") 


# Recalculate summary effect size assuming different 
# heterogeneity components 


pes.logit.studydesign <- rma(estimate, sei = stderror, 
method = "FE", data = dat.diffvar) 

pes.studydesign <- predict (pes.logit.studydesign, 
transf = transf.ilogit) 


# Display summary effect sizes of the two subgroups 
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print(pes.birthcohort, digits = 6); 
print (pes.logit.birthcohort, digits = 3) 

print(pes.others, digits = 6); print(pes.logit.others, 
digits = 3) 

# Display subgroup analysis results 

print (subganal.studydesign, digits = 3) 

# Display recomputed summary effect size 

print (pes.studydesign, digits = 6) 


The outcomes of the subgroup analysis appear in Figure 13. 


pred ci. Eb ci.ub pi. 1b pi.ub 
0.000352 0.000158 0.000782 0.000045 0.002737 


Random-Effects Model (k = 6; tau^2 estimator: DL) 


tau^2 (estimated amount of total heterogeneity): 0.932 (SE = 0.866) 
tau (square root of estimated tau^2 value): 0.966 
I°2 (total heterogeneity / total variability): 98.55% 
H*2 (total variability / sampling variability): 68.92 
Test for Heterogeneity: 
Q(df = 5) = 344.594, p-val < .001 
Model Results: 
estimate se zval pval gi. Ib ci.ub 
-7.952 0.408 -19.501 <.001 -8.752 -7.153 *x* 
Signif. codes: O >****? 0.001 ’**? 0.01 ’*’? 0.05 >. O.1 ? ? 1 
pred ci.lb ci.ub pi.lb pi.ub 
0.000472 0.000341 0.000653 0.000169 0.001317 
Random-Effects Model (k = 11; tau^2 estimator: DL) 
tau"2 (estimated amount of total heterogeneity): 0.247 (SE = 0.175) 
tau (square root of estimated tau^2 value): 0.497 
I*2 (total heterogeneity / total variability): 95.76% 
H*2 (total variability / sampling variability): 23.59 
Test for Heterogeneity: 
Q(df = 10) = 235.944, p-val < .001 
Model Results: 
estimate se zval pval ei ..ib ci.ub 
-7.658 0.166 -46.161 <.001 -7.984 -7.333 *x* 
Signif. codes: O ’***? 0.001 ’**? 0.01 ’*’ 0.05 ’.’? 0.1 ? ? 1 


Fixed-Effects with Moderators Model (k = 2) 


I*2 (residual heterogeneity / unaccounted variability): 0.00% 
H*2 (unaccounted variability / sampling variability): 1.00 
R72 (amount of heterogeneity accounted for): NA% 


Test for Residual Heterogeneity: 
QE(df = 0) = 0.000, p-val = 1.000 
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Test of Moderators (coefficient 2): 
QM(df = 1) = 0.445, p-val = 0.505 


Model Results: 


estimate se zval pval ci.lb ci.ub 
intrcpt -7.952 0.408 -19.501 <.001 -8.752 -7.153 *** 
studydesignOthers 0.294 0.440 0.667 0.505 -0.569 1.157 
Signif. codes: 0 ’***? 0.001 ’**? 0.01 ?’°*? 0.05 ’.’ 0.1? ? 1 
pred ci Ib ci.ub 


0.000453 0.000335 0.000611 


Figure 13. A subgroup analysis assuming different between-study variance compo- 
nents 


From the output above, we can derive that the summary effect estimates 
are 0.00035 (95% CI = 0.00016, 0.00078), 0.00047 (95% CI = 0.00034, 0.00065), 
and 0.00045 (95% CI = 0.00034, 0.00061) for the two subgroups and the overall 
group of studies, respectively. Note that the subgroup summary effect estimates 
are derived by taking the exponential of the model results (e.g., exp(-7.952) = 
0.00035). When we fit separate random-effects models in the two subgroups, we 
decide to allow the amount of variance within each set of studies to be different, 
which results in two different within-group estimates of T? (0.93 and 0.25 for 
studies using the birth cohort design and other study designs, respectively). In 
other words, studies within each subgroup share the same estimate of 7? . 

The results reveal that the difference between the two subgroup summary 
estimates is not statistically significant (QM(1) = 0.45, p = 0.51). Note that 
the sum of the within-group heterogeneity across the subgroups in the fixed- 
effect model is equal to QE(0) = 0, p = 1. This is because the within-group 
heterogeneity has been accounted for in each subgroup (Q(df = 5) = 344.594, 
p < 0.001; Q(df = 10) = 235.944, p < 0.01, respectively) in the random-effects 
model, thus there is no heterogeneity left to be accounted for. 

In the second situation where we assume a common between-study variance 
component across subgroups, execute the following code: 


# Conduct a subgroup analysis based on studydesign 


subganal.studydesign <- rma(yi, vi, data = ies.logit, 
mods = ~ studydesign, method = "DL") 

pes.subg.studydesign <- predict (subganal.studydesign, 
transf = transf.ilogit) 

# Obtain estimates for each subgroup 
pes.logit.birthcohort <- rma(yi, vi, data = ies.logit, 
mods = ~ studydesign == "Others", method = "DL") 
pes.logit.others = rma(yi, vi, data = ies.logit, mods 


= ~ studydesign == "Birth cohort", method = "DL") 
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# Create a dataframe to store effect size estimates, 

# standard errors, heterogeneity for both subgroups 

dat.samevar <- data.frame(estimate = 
c((pes.logit.birthcohort$b) [1], 
(pes.logit.others$b) [1]), stderror = 
c((pes.logit.birthcohort$se) [1], 
(pes.logit.others$se)[1]), tau2 = 
subganal.studydesign$tau2) 

# Recalculate summary effect size assuming a common 

# heterogeneity component 


pes.logit.studydesign = rma(estimate, sei = stderror, 
method = "FE", data = dat.samevar) 

pes.studydesign = predict (pes.logit.studydesign, 
transf = transf.ilogit) 


# Display subgroup summary effect sizes 
print (pes.subg.studydesign[1], digits = 6) 
print (pes.subg.studydesign[17], digits = 6) 
# Display subgroup analysis results 

print (subganal.studydesign, digits = 4) 

# Display recomputed summary effect size 
print (pes.studydesign, digits = 6) 


The outcome of the subgroup analysis appears in Figure 14. This output is 
fairly self-explanatory. Based on this output, we can derive that we have fitted a 
mixed-effects model, meaning a random-effects model is used to combine studies 
within each subgroup and a fixed-effect model is used to combine the subgroups 
and estimate the summary effect size. The amount of within-group heterogeneity 
across the two subgroups is assumed to be the same (7? = 0.44 in this case). This 
combined estimate is derived by pooling the two within-group variance estimates 
as displayed earlier (T? = 0.93 and 7? = 0.25). Once we have the pooled estimate, 
we then apply it to each study across the two subgroups, meaning every study 


now shares the same estimate of 7? (i.e., 0.44). 
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Mixed-Effects Model (k = 17; tau^2 estimator: DL) 


tau^2 (estimated amount of residual heterogeneity): 0.4427 (SE = 0.2518) 
tau (square root of estimated tau^2 value): 0.6654 
I°2 (residual heterogeneity / unaccounted variability): 97.42% 
H*2 (unaccounted variability / sampling variability): 38.70 
R*2 (amount of heterogeneity accounted for): 0.00% 


Test for Residual Heterogeneity: 
QE(df = 15) = 580.5386, p-val < .0001 


Test of Moderators (coefficient 2): 
QM(df = 1) = 0.9202, p-val = 0.3374 


Model Results: 


estimate se zval pval ci.lb ci.ub 
intrcpt -7.9742 0.2892 -27.5726 <.0001 -8.5411 -7.4074 *** 
studydesignOthers 0.3452 0.3599 0.9593 0.3374 -0.3601 1.0506 
Signif. codes: 0O ’***? 0.001 ’**? 0.01 ’?*’ 0.05 ’.’? 0.1 ? ? 1 
pred ci.lb ci.ub pi.lb pi.ub 


1 0.000344 0.000195 0.000606 0.000083 0.001425 


pred ci.lb ci.ub pi.lb pi.ub 
17 0.000486 0.000319 0.000739 0.000124 0.001910 


pred ci.lb ci.ub 
0.000430 0.000307 0.000602 


Figure 14. A subgroup analysis assuming a common between-study variance compo- 
nent 


The test of moderators suggests that the study design does not have a mod- 
erating effect (QM(1) = 0.92, p = 0.34). That is, when we divide the included 
studies according to their study designs, we fail to find any significant differences 
between the two subgroups of effect sizes. This conclusion is also supported by 
the results of the test for residual heterogeneity: there is significant unexplained 
heterogeneity left in the effect sizes (QE(15) = 580.54, p < 0.01), which can also 
explain why R? shows 0%. Finally, the estimates for the two subgroup summary 
proportions and the overall summary proportion are displayed at the bottom of 
the output. They are 0.00034 (95% CI = 0.0002, 0.00061), 0.00049 (95% CI = 
0.00032, 0.00074), and 0.00043 (95% CI = 0.00031, 0.0006), respectively. 

There are several other points that are worth noting. Under the framework 
of the mixed-effect model, the residual heterogeneity estimate here (QE(15) = 
580.54) is the sum of the two within-group heterogeneity estimates we have 
obtained above in the random-effects model (Q(df = 5) = 344.59, Q(df = 10) 
= 235.94, respectively). When we dummy-code a moderator with two categories, 
the subset of studies coded as 0 in a dummy variable will function as the reference 
group, represented by the intercept of the fitted mixed-effects regression model. 
The other subset of studies coded as 1 will be compared against the reference 
group. In the running example, the “Birth cohort” group is the reference group, 
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while the “Others” group is compared against it. The estimate of the intercept 
(i.e., -7.97) is the logit-transformed summary effect size of the reference group 
(i.e, logit(0.00034)). The slope is estimated to be 0.35. The sum of the slope and 
the intercept is equal to -7.629, which is the logit-transformed summary effect 
size of the “Others” group (i.e., logit(0.00049)). 

When calculating the summary effect estimate across the subgroups, the out- 
comes may vary depending on the specific T? estimate applied. However, even 
with this variation, the two computational models may reach the same qualita- 
tive conclusions. For instance, in the given example, both models agree that the 
study design doesn’t significantly influence the results. In general, Borenstein et 
al. (2009) recommend pooling the separate 7? when the number of studies in a 
subgroup is small (i.e., less than five studies). In doing so, we can obtain a more 
accurate estimate of T. In contrast, if we decide not to pool them, each sub- 
group should ideally consist of at least five studies to ensure moderately stable 
estimates of 7°. 


7.4 Creating forest plots in the presence of subgroups in R 


Many authors conducting meta-analyses of proportions did not construct for- 
est plots correctly for their subgroup analyses. Specifically, numerous published 
meta-analytic studies did not present the appropriate estimates for either the 
overall or subgroup summary proportions in their forest plots. These authors 
failed to consider the two possible assumptions about 7? that we have discussed 
in Section 7.3. 

In this section, we will construct forest plots with subgroups under different 
assumptions (i.e., separate between-study variance components vs. a common 
between-study variance component). We have obtained the estimates for sub- 
group and overall summary proportions in the previous section, which can be 
used to create our forest plots. The following code is used to construct forest 
plots under the first assumption: 


# Assumption 1: Do not assume a common between-study 
# variance component (use separate within-group 
# estimates of between-study variance). 


# Option 1: no transformation 


ies.summary <- summary(ies, ni = dat$total) 
forest (ies.summary$yi, ci.lb = ies.summary$ci.1b, 
ci.ub = ies.summary$ci.ub, rows = c(d:c, b:a)) 


# Option 2: the logit transformation 

ies.summary <- summary(ies.logit, transf = 
transf.ilogit) 

forest (ies.summary$yi, ci.lb = ies.summary$ci.1b, 
ci.ub = ies.summary$ci.ub, rows = c(d:c, b:a)) 
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# Option 3: the double arcsine transformation 

ies.summary <- summary(ies, transf = transf.ipft, ni = 
dat$total) 

forest (ies.summary$yi, 


ci.lb = ies.summary$ci.1b, 
ci.ub = ies.summary$ci.ub, 
rows = c(d:c, b:a)) 


The code above merely builds the “bones” of a forest plot. More components 
need to be added to it (e.g., texts, headers, labels, etc.). We also have to man- 
ually adjust its appearance to make it look more professional. Dividing a set 
of included studies into several subgroups in a forest plot using metafor has to 
be done manually with the rows argument. Readers may have noticed that the 
parameters in the argument (a, b, c, and d denotes a particular position on the 
Y-axis) are ordered from right to left. a specifies the vertical position for plotting 
the first study in the first subgroup; b specifies the vertical position for plotting 
the last study in the first subgroup; c specifies the vertical position for plotting 
the first study in the second subgroup; d specifies the vertical position for plot- 
ting the last study in the second subgroup. Mathematically speaking, b— a+ 1 
and d — c + 1 should be equal to the number of studies in their corresponding 
subgroups. c and b do not need to be consecutive numbers. If we order these 
parameters from left to right, studies will be displayed in reverse order with the 
first study being displayed at the bottom of the plot and the last study being 
displayed at the top of all the studies. 

To illustrate, we can execute the following code to create a forest plot using 
the study design as the moderator: 


# Run the subgroup analysis code with the assumption 

# of separate within-group estimates of between-study 

# variance components first, then run the following 

# code 

ies.summary <- summary(ies.logit, transf = 
transf.ilogit) 

# par() function specifies font parameters 

par(cex = 1, font = 6) 

# Set up forest plot 

# order= argument ensures that studies are divided by 

# the subgroup variable 

forest (ies.summary$yi, 


order = ies.summary$studesg, 
ci.lb = ies.summary$ci.1b, 
ci.ub = ies.summary$ci.ub, 


ylim = c(-5, 23), 

xlim = c(-0.005, 0.005), 

slab = paste(dat$author, dat$year, sep = ","), 
ilab = cbind(data = dat$cases, dat$total), 
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ilab.xpos = c(-0.0019, -0.0005), 
ilab.pos = 2, 
rows = c(19:14, 8.5:-1.5), 
at = c(seq(from = 0, to = 0.004, by = 0.001)), 
refline = pes.studydesign$pred, 
main = "" 
xlab = "Proportion", 
digits = 4) 


# Add summary 
# proportions 


polygons for the subgroup and overall 


par(cex = 1.2, font = 7) 

addpoly(pes.birthcohort$pred, ci.lb = 
pes. birthcohort$ci.lb, ci.ub = 
pes.birthcohort$ci.ub, row = 12.8, digits = 5) 

addpoly(pes.others$pred, ci.lb = pes.others$ci.1b, 
ci.ub = pes.others$ci.ub, row = -2.7, digits = 5) 

addpoly (pes.studydesign$pred, ci.lb = 
pes.studydesign$ci.lb, ci.ub = 
pes.studydesign$ci.ub, row = -4.6, digits = 5) 

# Add column headings to the plot 

par(cex = 1.1, font = 7) 

text(-0.005, 21.8, pos = 4, "Study") 

text(c(-0.0026, -0.0014), 21.8, pos = 4, c("Cases", 
"Total")) 

text (0.0025, 21.8, pos 4, "Proportion [95% CI]") 

# Add text for the subgroups 

text (-0.005, c(9.7, 20.2), pos = 4, c("Others", "Birth 
cohort") ) 

# Add text for the subgroup and overall proportions 

par(cex = 1, font = 7) 

text (-0.005, -4.6, pos = 4, c("Qverall proportion") ) 

text (-0.005, 12.8, pos = 4, c("Subgroup proportion") ) 

text (-0.005, -2.7, pos = 4, c("Subgroup proportion") ) 

abline(h = -3.7) 


The generated forest plot is shown in Figure 15. Notice that the overall 
summary proportion is 0.00045 (95% CI = 0.00033, 0.00061) under the given 
assumption, which is different than the one derived in the absence of subgroups 
(0.00042). 
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Study Cases Total Proportion [95% Cl] 
Birth cohort 
Stewart-Brown,1988 7 12853 — 0.0005 [0.0003, 0.0011] 
SanGiovanni,2002 73 53639 n) 0.0014 [0.0011, 0.0017] 
Stayte,1993 4 6687 e 0.0006 [0.0002, 0.0016] 
Haargaard,2004 773 2616439 a 0.0003 [0.0003, 0.0003] 
Bermejo,1998 71 1124654 E 0.0001 [0.0001, 0.0001] 
Stoll,1997 57 212479 HH 0.0003 [0.0002, 0.0003] 
Subgroup proportion ae 0.00035 [0.00016, 0.00078] 
Others 
Rahi,2001 248 734000 m 0.0003 [0.0003, 0.0004] 
Wirth,2002 421 1870000 L] 0.0002 [0.0002, 0.0002] 
Hu,1987 77 207319 H 0.0004 [0.0003, 0.0005] 
Abrahamsson, 1999 136 377334 m 0.0004 [0.0003, 0.0004] 
Bhatti,2003 199 982128 . 0.0002 [0.0002, 0.0002] 
Halilbasic,2014 51 38133 c 0.0013 [0.0010, 0.0018] 
Chen,2014 6 9246 H+ 0.0006 [0.0003, 0.0014] 
Yang,2014 8 6299 c 0.0013 [0.0006, 0.0025] 
Pi,2012 3 3079 H 10.0010 [0.0003, 0.0030] 
Holmes,2003 10 33021 = 0.0003 [0.0002, 0.0006] 
Nie,2008 15 15398 e 0.0010 [0.0006, 0.0016] 
Subgroup proportion > 0.00047 [0.00034, 0.00065] 
Overall proportion > 0.00043 [0.00031 , 0.00060] 
T 
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Figure 15. A forest plot with subgroups assuming different 7? generated by metafor 


Under the assumption of a common 7°, we employ the rma() function in 


metafor in conjunction with the metaprop() and forest() functions in meta to 
produce a forest plot with subgroups. The inclusion of predictors is set by the 
mods argument in metafor and the byvar argument in meta. In the metaprop() 
function, two arguments are particularly noteworthy: tau.common determines 
whether a common 7? estimate is applied across subgroups, while tau.preset 
sets the value of 7. Given our assumption, we set tau.common to TRUE and 
tau.preset to the pooled 7 estimate obtained from the previous section. 


# Assumption 2: Assume a common between-study variance 

# component (pooling within-group estimates of 

# between-study variance) 

# data= could also be set to ies.logit or ies.da 

subganal.moderator <- rma(yi, vi, data = ies, mods = 
moderator, method = "DL") 

# sm= could also be set to "PLO" or "PFT" 

# tau.common= must be TRUE and tau.preset must be 

# sqrt (subganal.moderator$tau2) 
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pes.summary <- metaprop(cases, total, authoryear, data 
= dat, sm = "PRAW", byvar = moderator, 
tau.common=TRUE, tau.preset = 
sqrt (subganal.moderator$tau2) ) 

# resid.hetstat= must be FALSE 

forest (pes.summary, resid.hetstat = FALSE) 


Assuming that we apply a common 7? across subgroups, the following code 
creates a customized forest plot using the study design as the moderator: 


subganal.studydesign <- rma(yi, vi, data = ies.logit, 
mods = ~ studydesign, method = "DL") 

pes.summary <- metaprop(cases, total, authoryear, data 
= dat, sm = "PLO", method.tau = "DL", method.ci = 


"NAsm", byvar = studydesign, tau.common=TRUE, 
tau.preset = sqrt (subganal.studydesign$tau2) ) 
forest (pes.summary, 

common = FALSE, 

overall = TRUE, 

overall.hetstat = TRUE, 

resid.hetstat = FALSE, 

subgroup.hetstat = TRUE, 

test.subgroup = FALSE, 

fs.hetstat = 10, 

print.tau2 = TRUE, 

print.Q = TRUE, 

print.pval.Q = TRUE, 

print.I2 = TRUE, 

rightcols = FALSE, 

xlim = c(O ,4), 


leftcols = c("studlab", "effect", "ci"), 
leftlabs = c("Study", "Proportion", "95% C.1I."), 
text.random.w = "Subgroup proportion", 
text.random = "Overall proportion", 

xlab = "Prevalence of CC (%)", 

pscale = 1000, 

smlab = " ", 

weight.study = "random", 

squaresize = 0.5, 

col.square = "navy", 

col.diamond = "maroon", 
col.diamond.lines = "maroon", 


digits = 2) 


The generate forest plot is presented in Figure 16: 
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Study Proportion 95% C.l. 
Stewart-Brown 1988 0.54 [0.26; 1.14] —-E— 
SanGiovanni 2002 1.36 [1.08; 1.71 + 
Stayte 1993 0.60 [0.22; 1.59] —#——— 
Haargaard 2004 0.30 [0.28;0.32] @ 
Bermejo 1998 0.06 [0.05; 0.08] m 
Stoll 1997 0.27 [0.21; 0.35 
=_> 
Rahi 2001 0.34 [0.30; 0.38 
Wirth 2002 0.23 [0.20;0.25] @ 
Hu 1987 0.37 [0.30; 0.46] E 
Abrahamsson 1999 0.36 [0.30; 0.43 
Bhatti 2003 0.20 [0.18; 0.23] B 
Halilbasic 2014 1.34 [1.02; 1.76 =E 
Chen 2014 0.65 [0.29; 1.44] —#—— 
Yang 2014 1.27 [0.64; 2.54 me 
Pi 2012 0.97 [0.31; 3.02] §——#—____ 
Holmes 2003 0.30 [0.16; 0.56] m- 
Nie 2008 0.97 [0.59; 1.62 = 
_ 
Overall proportion 0.43 [0.31; 0.60] > 
Heterogeneity: 1° = 97%, 1° = 0.4427, 7%, = 580.54 (p < 0.01) T T T l 
0 1 2 3 4 


Prevalence of CC (%o) 


Figure 16. A forest plot with subgroups assuming a common T° 


Notice that the estimates of 7? are identical (0.4427) across two subgroups. 
The overall summary proportion and its 95% CI (0.43; 95% CI = 0.31, 0.6) are 
calculated across two subgroups based on the same 7°? estimate, as well. 


7.5 Conducting meta-regression with different types of predictors 
in R 


When we want to evaluate the influence of a continuous moderator, the R code 
is identical to what we used for subgroup analyses: 


#data= could also be set to ies.logit or ies.da 
metareg.moderator <- rma(yi, vi, data = ies, mods = 
moderator) 


As mentioned above, a mix of continuous and categorical moderators can be 
regressed on the effect sizes in a meta-regression model. This can be achieved by 
using the plus sign in the mods argument: 


#data= could also be set to ies.logit or ies.da 
metareg.moderators <- rma(yi, vi, data = ies, mods = 
~moderatorA + moderatorB + moderatorC + ...) 
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7.6 Visualizing moderator analyses with scatter plots in R 


Scatter plots serve as an invaluable visualization tool when assessing potential 
moderator variables. Such plots, as depicted in Figure 17, are constructed with 
a regression line, flanked by two curved dotted lines that represent the 95% con- 
fidence interval bounds, with studies represented by circles drawn proportional 
to their study weights (i.e., larger studies appear as larger circles). What’s im- 
portant in scatter plots is the slope of the regression line. Specifically, if the 
regression line is horizontal or nearly so, it suggests there’s no significant asso- 
ciation between the moderator and the effect sizes. Conversely, if the regression 
line has a noticeable slope, it indicates the effect sizes change in relation to the 
value of the moderator. To determine the significance of this relationship, one 
can look at the slope and its significance test. A notably positive or negative 
slope indicates that the predictor plays a significant moderating role, potentially 
explaining a significant portion of the observed heterogeneity. 
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Figure 17. A basic scatter plot 


In this section, we will employ the regplot() function in the metafor package 
to create scatter plots. regplot() offers a distinct advantage over R’s native plot() 
function. It simplifies the coding process, making it more user-friendly, especially 
for those less familiar with R. It helps users to customize their scatter plots with 
ease. 

The following generic code creates weighted scatter plots for subgroup anal- 
yses. In a weighted scatter plot, a study is represented by a circle. The weight 
of a study is depicted by the size of the circle, with a larger circle indicating a 
greater study weight. In an unweighted scatter plot, the circles are of equal size. 
Additionally, it is necessary to use dummy variables for categorical moderators 
(e.g., variables labeled as “studesg” in the running example). 
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# Option 1: 


# Option 2: 


transf=transf.ilogit) 


# Option 3: 


no transformation 
regplot (subganal.dummyvar, mod = "dummyvar") 


the logit transformation 
regplot(metareg.dummyvar, mod = "dummyvar", 


the double arcsine transformation 
# targ can also be set to list(ni 
regplot (subganal.dummyvar, mod = "dummyvar", 

transf=transf.ipft.hn, 


1/(pes.da$se) ~2) 


targ=list (ni=dat$total) ) 


Using the running example, we can create a customized scatter plot with a re- 
gression line and corresponding 95% CI bounds for “studesg” with the following 


code: 


# Conduct a subgroup 


# variable "studesg" 
subganal.studesg=rma (yi, vi, 


studesg, method = "DL") 


# Create a scatter plot 


regplot (subganal.studesg, mod 
xlab = "Study Design", 


transf=transf.ilogit, 
legend = FALSE, 


label = TRUE, 

shade = "white", 

bg = "transparent", 
lcol = "navy", 


digits = 4) 


analysis based on the dummy 


ies.logit, 


"studesg", 


The generated scatter plot is shown in Figure 18. 
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Figure 18. A scatter plot using the study design as the moderator 


Upon visual inspection of the scatter plot, it is evident that the slope of 


the estimated regression line is neither entirely horizontal nor excessively steep, 
suggesting a weak association between the study design and the observed ef- 
fects. Furthermore, nearly half of the studies fall outside of the 95% CI bounds, 
indicating the presence of potentially unidentified moderators.’ 


In the second example, we use the sample size as the moderator (the variable 


“size” in the provided data set) and evaluate it in a subgroup analysis: 


subganal.size <- rma(yi, vi, 
size, method = "DL") 
regplot (subganal.size, 


mod = "size", 
transf=transf.ilogit, 
xlab = "Sample size", 
legend = "topright", 
label = TRUE, 

shade = "white", 

bg = "transparent", 
lcol = "navy", 


digits = 6) 


data = ies.logit, mods = 


T If one wants to change the curved slope and 95% CIs lines to straight lines, further 
steps are needed in R. I’ve included relevant R code in the supplementary materials. 
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The generated scatter plot is presented in Figure 19. The code is self-explanatory. 
Note that the legend argument determines if a legend is added to the scatter 
plot, with its location specified by the user. 
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Figure 19. A scatter plot using sample size as the moderator 


In this case, the estimated regression line exhibits a noticeably steeper slope. 
A visual inspection of this scatter plot indicates a negative correlation between 
the sample size and the observed proportions. When the sample size is less than 
100,000, the proportions tend to be higher; when the sample size is larger than 
100,000, the proportions tend to be lower. Again, it is important to acknowledge 
that potential missing moderators may introduce a degree of omitted variable 
bias here. The outcomes of the subgroup analysis are shown below in Figure 20. 
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Mixed-Effects Model (k = 17; tau^2 estimator: DL) 


tau^2 (estimated amount of residual heterogeneity): 0.1398 (SE = 0.0911) 
tau (square root of estimated tau~2 value): 0.3739 

I°2 (residual heterogeneity / unaccounted variability): 93.90% 

H*2 (unaccounted variability / sampling variability): 16.40 

R72 (amount of heterogeneity accounted for): 57.07% 


Test for Residual Heterogeneity: 
QE(df = 15) = 246.0073, p-val < .0001 


Test of Moderators (coefficient 2): 
QM(df = 1) = 36.4266, p-val < .0001 


Model Results: 


estimate se zval pval ci.lb ci.ub 
intrcpt -7.0500 0.1643 -42.9109 <.0001 -7.3720 -6.7280 *x** 
size -1.2867 0.2132 -6.0354 <.0001 -1.7046 -0.8689 *** 
Signif. codes: 0 ’***? 0.001 ’**? 0.01 ’*? 0.05 ’.’? 0.1? ? 1 


Figure 20. A subgroup analysis for sample size 


The results of the test of moderators (QM (1) = 36.43, p < 0.0001) as well 
as the significant regression coefficient (—1.29; Z (15) = —6.04, p < 0.0001) are 
consistent with our visual interpretation. In stark contrast with the previous 
subgroup analysis, the R? indicates that 57.07% of the true heterogeneity in the 
observed effect size can be explained by the sample size. 

In the running example, Wu et al. (2012) did not examine any continuous 
predictors. To demonstrate how to generate a weighted scatter plot for a meta- 
regression with a continuous predictor in R, we will plot the observed effect 
sizes against the year of publication, represented by the “year” variable in the 
provided dataset. The code is provided below: 


metareg.year <- rma(yi, vi, data = ies.logit, mods = ~ 
year, method = "DL") 
regplot (metareg.year, 
mod = "year", 
transf = transf.ilogit, 
xlab = "Year of publication", 
legend = "topleft", 
label = TRUE, 
shade = "white", 
bg = "white", 
lcol = "navy", 


digits = 6) 


The generated scatter plot is presented in Figure 21. 
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Figure 21. A scatter plot using the publication year as the moderator 


8 Common procedures addressing publication bias do 
not apply to meta-analyses of proportions 


One of the major threats to the validity of meta-analysis is publication bias. 
This is a phenomenon where journals tend to accept and publish a study de- 
pending on the direction or strength of its results (MarksAnglin & Chen, 2020). 
Compared with studies with statistically significant results, small studies re- 
porting insignificant results or small effects are less likely to be published and 
subsequently included in a meta analysis (Dickersin, 1990; Littell et al., 2008). 
Omitting unpublished studies in a systematic review could lead to a biased 
meta-analytic estimate of the summary effect (Song, Eastwood, Gilbody, Duley, 
& Sutton, 2000). As smaller studies require larger effect sizes to achieve statis- 
tical significance (Sterne, Gavaghan, & Egger, 2000), only those small studies 
with large effects get published and included in a relevant meta-analysis. Thus, 
a meta-analysis that only includes studies with large effects and fails to include 
studies with small effects at the same time could overestimate the true effect 
(Cuijpers, 2016). 

Current methods of detecting publication bias and assessing its impact are 
developed for meta-analyses of randomized control trials. These methods rely 
on certain assumptions (Borenstein et al., 2009). Firstly, regardless of the sig- 
nificance of their effects, large studies are most likely to be published. Secondly, 
only small studies demonstrating significant and substantial effects tend to be 
published. Lastly, most moderate-scale studies that yield significant results also 
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tend to be published. Consequently, as the sample size of a study decreases, the 
likelihood of it being affected by publication bias increases. Traditional meth- 
ods such as trim-and-fill, the rank correlation test, Egger’s regression model, 
as well as the more sophisticated weighted selection approaches (e.g., Vevea & 
Hedges, 1995; Vevea & Woods, 2005) have all operated under the assumption 
that the publication likelihood depends on sample size, statistical significance, 
or the direction of results (Coburn & Vevea, 2015). 

While empirical research has confirmed the dominant role of statistical sig- 
nificance in study publication (Preston, Ashby, & Smyth, 2004), the actual pub- 
lication selection process across different fields is much more intricate. Cooper, 
DeNeve, and Charlton (1997) have demonstrated that decisions regarding study 
publication are influenced by various criteria or “filters” set by journal editors 
and reviewers, independent of methodological quality and significance. These 
filters can include factors such as research funding sources, societal preferences 
related to race and gender during the study’s conduction, and even findings that 
challenge pre-existing beliefs. Consequently, the traditional methods may fail to 
capture the full complexity of the publication selection process. 

In practice, authors of meta-analyses of proportions have employed these 
methods in their attempts to detect publication bias. However, studies included 
in meta-analyses of proportions are observational and non-comparative. In other 
words, they only report a proportion or prevalence of an event, which inherently 
precludes the testing of statistical significance for their findings (Borenstein, 
2019). Consequently, the interpretation of the outcomes from such studies is not 
contingent on the null hypothesis significance test and thus cannot be catego- 
rized as either “positive/negative” or “desirable/undesirable.” The significance 
levels are, therefore, unlikely to influence publication decisions regarding these 
studies (Maulik, Mascarenhas, Mathers, Dua, & Saxena, 2011). Authors who re- 
port low proportions (e.g., rare event rates) are equally likely to have their work 
published as those reporting very high proportions (e.g., high cure rates), given 
that the study quality meets rigorous publication standards. Consequently, the 
traditional publication bias assessment procedures may struggle to identify pub- 
lication bias in meta-analyses of proportions, as bias in non-comparative studies 
can be introduced for reasons unrelated to statistical significance. 

Borenstein (2019) warns meta-analysts that it is a mistake to apply publi- 
cation bias procedures to studies of prevalence. Our suggestion aligns with his. 
When conducting meta-analyses of proportions, we believe that the traditional 
publication bias tests and modeling tools developed for randomized controlled 
trials have limited utility and, therefore, should not be used. Any conclusions 
drawn regarding the presence of publication bias based on these methods should 
be approached with caution. 
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